Data analysis - documentation : GERALD User Guide and FAQ
This page last changed on Dec 11, 2006 by ac.
This is a user manual/FAQ for GERALD. GERALD is the program managing the part of the pipeline used to do sequence alignments, statistics and visualisation of output. Basic usagemake is a tool originally developed for and widely used for the compilation of large software projects from multiple source files. Each file that needs to be generated is described in terms of the files it depends upon and make figures out the order in which to build everything. These descriptions are stored in a file called a "makefile." The analysis of Solexa data may be viewed in a similar way to a compilation problem and make was found to be a useful way of building large, complicated analyses. However, "user friendly" is not a term that one would associate with the make syntax, to say the least, so GERALD is an attempt to shield users from the necessity of writing their own makefiles. What it does is take a simple text-based config file and translates it to a makefile that describes the analysis in terms of "what file depends on what." This makefile can then be used to build the analysis using GNU make by typing "make" under either Linux or Cygwin (or related operating systems). GERALD is usually run automatically as part of an overall pipeline run. However there will frequently be occasions where it needs to be run on its own. The standard way to run GERALD is to give it a config file of parameters, as follows:
As a result of running GERALD, a new directory named something like "GERALD_16-05-2006_user" will have been created (the date is the current date, and "user" is your computer login). If you want to rerun the analysis and change parameters, you can rerun GERALD with the new parameters. A new directory will be created, and no information will be overwritten by default. In earlier versions of GERALD, analysis parameters were given to GERALD via the command line. An example of a command line GERALD invocation would be: GERALD.pl --EXPT_DIR /data/00002887/Data/C1-36_Phoenix1.8.11_09-02-2006_user.2/Bustard1.8.11_09-02-2006_user/ --FORCE --GENOME_DIR /data/Genomes --GENOME_FILE BAC_plus_vector.fa Command-line arguments take precedence over parameters set in the config file, so command-line parameter specification can be used to override paramters in a default config file. Parameters can also be set as environment variables. The meaning of the parameters will be explained below. Using config filesRunning the analysisFirst you may wish to run 'make self_test.' This performs some basic checks on your setup, e.g. that all scripts it will need to call are present and are executable. It also performs some consistency checks on the paramters in the config file. A 'make self_test' is performed as the first step of any of the builds below. After that you can kick off the analysis with e.g.
You may do a partial build of your analysis as follows. make TILE=s_3_0012 To build tiles 5,10,15,20,... from lanes 3 and 6 make TILE=s_[36]_[0-9][0-9][0-9][05] This feature may be useful for a "sneak preview" of your results, after which a full analysis may be built as described above. An example GERALD config fileHere is an example config file that uses most of the new features and documents most of the parameters. # Example config file for GERALD.pl version 1.15 and above # Lines preceded by a hash are comments # notify user at end of analysis run EMAIL_LIST user@solexa.com EMAIL_SERVER mailserver EMAIL_DOMAIN solexa.com # The email report sent at the end of the analysis run contains hyper-links # with the following prefix to the run-folder WEB_DIR_ROOT file://server.solexa.co.uk/share # Specify read length for experiment; this is the read length used for # the alignments, not the sequenced read length; consequently, its value # has to be less or equal to the sequence length. READ_LENGTH 25 # Ignore the first base (for example, it could be part of the primer). # The number of Ys has to match the value of READ_LENGTH - yes, that # can be awkward to enter manually. USE_BASES nYYYYYYYYYYYYYYYYYYYYYYYYY # Specify the genome reference for alignment GENOME_FILE BAC_plus_vector.fa # Specify where the genome file can be found GENOME_DIR /home/user/Genomes # Align against a genomic sample; i.e., allow alignments to arbitrary # positions of the provided reference. ANALYSIS default # Now some lane-specific options... # Only align 20 cycles for lane 7 7:READ_LENGTH 20 # Lanes 567 are monotemplates; i.e., the reference contains a set of tags, # and the alignment software is constrained to match at the first base position 567:ANALYSIS monotemplate 567:USE_BASES all 56:GENOME_FILE hayward12.fa 7:GENOME_FILE magnificent7.fa # Lane 8 is primers only so switch it off completely 8:ANALYSIS none # filtering parameters # (default is '(CHASTITY>=0.6)' and 'PURE_BASES 12' 3:QF_PARAMS (NEIGHBOUR>=5.0)&&(PURITY>=0.7)&&(X_COORD>50) 3:PURE_BASES 1 # Found some bad tiles, these still get aligned but # get excluded from coverage BAD_TILES s_1_0001 s_2_0003 Some additional options: # path to the experiment directory (if not specified on command-line) EXPT_DIR /data/00002887/Data/C1-36_Firecrest1.8.19_20-06-2006_user/Bustard1.8.19_20-06-2006_user # Output directory (if other than a new GERALD folder # inside the EXPT_DIR) OUT_DIR /home/user/Test4 # Run command below once all 'make all' targets have been built POST_RUN_COMMAND /yourPath/yourCommand yourArgs Parameters and points to noteThe ANALYSIS variableYou can set a variable ANALYSIS to define what goes on in each lane. ANALYSIS defaultThis aligns each read against a reference sequence using the phageAlign program. All the other analysis modes are based on this mode to some extent. A full description of the files created during an analysis is given here, but briefly: _align.txt file for each _seq.txt file in lane ANALYSIS noneIf you set e.g. ANALYSIS elandSetting e.g. ANALYSIS monotemplate6:ANALYSIS monotemplate No coverage plots will be produced since they are not relevant here. Some monotemplate-specific output is produced instead. ANALYSIS expressionThis mode is meant for experiments that measure gene expression by counting the occurrence of sequence tags. Again this makes use of phageAlign's tag mode to align the reads. It is essentially a cut-down version of monotemplate mode that omits some output files that become unmanageably large when aligning against large numbers of templates. ANALYSIS sequenceSetting e.g. USE_BASES - if USE_BASES all is set, all sequenced bases will show up in this file. Otherwise only cycles which having a Y at the corresponding position in the USE_BASES string will appear. Note that in this mode the quality values are taken directly from the base caller (from the _prb.txt files via the _qraw.txt files). In default and eland modes a _sequence.txt file is still produced, but its quality values are taken from the recalibrated quality values contained in the _qcal.txt files. Setting parameters on a lane-by-lane basisThe previous section showed that the ANALYSIS parameter can be set on a lane-by-lane basis. Other parameters can also be set in this manner, and you will need to do this for any parameters specific to a particular lane's analysis. For a monotemplate lane you need to specify the file of templates, e.g. The other big application of this to set QF_PARAMS on a lane-by-lane basis Or: you could set a shorter READ_LENGTH for an individual lane. Note that the best way to filter out individual tiles is to set BAD_TILES to be a list of the tiles you want to filter. You can do this retrospectively as described below. USE_BASES option"USE_BASES odd" means use bases from odd-numbered cycles only. This is a relic from the days where we had enough time, inclination and disk space to acquire images during both the sequencing and deblocking steps.
The --FORCE optionWithout the "--FORCE" option GERALD will not create any directories and files and only operate in a diagnostic mode, in which parameters are displayed. In order to modify the run-folder structure on the hard-drive, this option needs to be specified. Rerunning the analysisThe config.txt file you use to run GERALD is copied to the output directory. The idea is that if you want to change parameters and rebuild the analysis you can (usually) modify the config file and do GERALD.pl config.txt --FORCE If you combine this with the OUT_DIR option, you can force GERALD to overwrite an existing Makefile. This way you can then modify the analysis without directly editing the Makefile. Contaminant filteringThe contents of a file of contaminant sequences may be filtered out. It attempts to do this in a rigorous way by comparing the alignment of each read against the data versus its best alignment to the contaminant sequences. A one-sequence-per-line ASCII file is expected, each sequence being at least READ_LENGTH bases in length. The name of this file should be specified in CONTAM_FILE to switch contaminant filtering on. This file is assumed to live in GENOME_DIR, else you need to specify its location in CONTAM_DIR. More details on the files generated during contaminant filtering may be found here. Setting up email reportingOne of the scripts called by GERALD sends an email report on successful completion of a run. It does this by talking directly to an SMTP server. You will need to specify such a server. See the pipeline installation page for more details. Getting started with GERALD on CygwinThis section is specific to Cygwin.
Other TBDs:
Some release notes for pipeline modulesThis is not a full version history of every pipeline module. Whenever I get around to it (not very often probably) I will add an entry to the top of this table showing changes to the current "best version" you should be using for the important software modules.
|
![]() |
Document generated by Confluence on Mar 09, 2007 16:11 |