Data analysis - documentation : GERALD User Guide and FAQ
This page last changed on Jul 25, 2008 by mzerara.
This is a user manual/FAQ for GERALD. GERALD is the program managing the part of the pipeline used to do sequence alignments, statistics and visualisation of output. For a general introduction to the usage of the analysis pipeline see Pipeline usage. Basic usagemake is a tool originally developed for and widely used for the compilation of large software projects from multiple source files. Each file that needs to be generated is described in terms of the files it depends upon and make figures out the order in which to build everything. These descriptions are stored in a file called a "makefile." The analysis of Genome Analyzer data may be viewed in a similar way to a compilation problem and make was found to be a useful way of building large, complicated analyses. GERALD is a wrapper to shield users from the necessity of writing their own makefiles. What it does is take a simple text-based config file (containing basic information such as the type of analysis to be performed and the reference files for a sequence alignment) and translates it to a makefile that describes the analysis in terms of "what file depends on what." This makefile can then be used to build the analysis using GNU make by typing "make". GERALD is usually run automatically as part of an overall pipeline run (see Pipeline usage). However there will frequently be occasions where it needs to be run on its own. The standard way to run GERALD is to give it a config file of parameters, as follows:
As a result of running GERALD, a new directory named something like "GERALD_16-08-2007_user" will have been created (the date is the current date, and "user" is your computer login). In recent versions of the Pipeline (from version 1.1), the default compression format for the sig2.txt files is gzip (as for int.txt files). If the sig2.txt files are not compressed (old Bustard run folders, 1.0 and below) one has to add "compression none" to the command line. If you want to rerun the analysis and change parameters, you can rerun GERALD with the new parameters. A new directory will be created, and no information will be overwritten by default. In earlier versions of GERALD, analysis parameters were given to GERALD via the command line. An example of a command line GERALD invocation would be: GERALD.pl --EXPT_DIR /data/070813_ILMN-1_0217_FC1234/Data/C1-27_Firecrest1.9.0_23-08-2007-user/Bustard1.9.0_23-08-2007_user/ --FORCE --compression <none|gzip|bzip2> --GENOME_DIR /data/Genomes --GENOME_FILE BAC_plus_vector.fa Command-line arguments take precedence over parameters set in the config file, so command-line parameter specification can be used to override paramters in a default config file. Parameters can also be set as environment variables. The meaning of the parameters will be explained below. Running the analysisAfter that you can kick off the analysis with e.g.
Additional make optionsYou may do a partial build of your analysis as follows. make TILE=s_3_0012 To build tiles 5,10,15,20,... from lanes 3 and 6 make TILE=s_[36]_[0-9][0-9][0-9][05] This feature may be useful for a "sneak preview" of your results, after which a full analysis may be built as described above. An example GERALD config fileHere is an example config file that uses most of the new features and documents most of the parameters. # Example config file for GERALD.pl # Lines preceded by a hash are comments # notify user at end of analysis run; this is optional EMAIL_LIST user@illumina.com EMAIL_SERVER mailserver EMAIL_DOMAIN illumina.com # The email report sent at the end of the analysis run contains hyper-links # with the following prefix to the run-folder; this is optional WEB_DIR_ROOT file://server.illumina.com/share # Ignore the first and last base of the read. Each n indicates that a cycle needs to be # ignored. Each Y indicates that the cycle is used for alignment. Wild cards are expanded # to the full length of the read. # For example, the first base could be ignored because it is part of the sequencing primer. # The last base could be ignored because it is not prephasing corrected and may have higher # error rates. USE_BASES nY*n # Specify the genome reference for alignment with Eland. This reference has previously been # produced by squashGenome ELAND_GENOME /home/user/Genomes/Eland/BAC_plus_vector/ # Specify where the genome file can be found GENOME_DIR /home/user/Genomes # Align against a genomic sample; i.e., allow alignments to arbitrary # positions of the provided reference. ANALYSIS eland # Run user-defined command below once all 'make all' targets have been built POST_RUN_COMMAND /yourPath/yourCommand yourArgs # Now some lane-specific options... # Only use 20 cycles for lane 7 7:USE_BASES nY20 # Only output sequence information for lanes 5, 6, 7. No alignments are performed. 567:ANALYSIS sequence 567:USE_BASES all # Lane 8 is primers only so switch it off completely 8:ANALYSIS none # filtering parameters # (default is '(CHASTITY>=0.6)' 3:QF_PARAMS '((NEIGHBOUR>=3.0)&&(CHASTITY>=0.7)&&(X_COORD>50))' Some more optional parameters: # path to the experiment directory (if not specified on command-line, or auto-completed # by goat_pipeline.py) EXPT_DIR data/070813_ILMN-1_0217_FC1234/Data/C1-27_Firecrest1.9.0_23-08-2007-user/Bustard1.9.0_23-08-2007_user/ # Output directory (if other than a new GERALD folder # inside the EXPT_DIR) OUT_DIR /home/user/Test4 ## Specify read length for experiment; this is the read length used for # the alignments, not the sequenced read length; consequently, its value # has to be less or equal to the sequence length. This is useful to force the # wild card expansion of USE_BASES to a predefined value READ_LENGTH 25 # Found some bad tiles, these still get aligned but # get excluded from coverage BAD_TILES s_1_0001 s_2_0003 Paired end analysis options: # Use paired end alignment mode of Eland ANALYSIS eland_pair # Use all bases of first read, ignore first and last base of second read USE_BASES Y*,nY*n # Ignore first base on both first and second read, use 25 bases each and ignore any other bases 6:USE_BASES nY25 Parameters and points to noteThe ANALYSIS variableYou can set a variable ANALYSIS to define what goes on in each lane. ANALYSIS elandSetting e.g. ANALYSIS sequenceSetting e.g. USE_BASES - see below. ANALYSIS defaultThis aligns each read against a reference sequence using the phageAlign program. All the other analysis modes are based on this mode to some extent. A full description of the files created during an analysis is given here, but briefly: _align.txt file for each _seq.txt file in lane The phageAlign program ANALYSIS noneIf you set e.g. ANALYSIS monotemplate6:ANALYSIS monotemplate No coverage plots will be produced since they are not relevant here. Some monotemplate-specific output is produced instead. ANALYSIS expression (DEPRECATED)This mode is meant for experiments that measure gene expression by counting the occurrence of sequence tags. Again this makes use of phageAlign's tag mode to align the reads. It is essentially a cut-down version of monotemplate mode that omits some output files that become unmanageably large when aligning against large numbers of templates. Note that in this mode the quality values are taken directly from the base caller (from the _prb.txt files via the _qraw.txt files). In default and eland modes a _sequence.txt file is still produced, but its quality values are taken from the recalibrated quality values contained in the _qcal.txt files.
ANALYSIS eland_extendedThis is intended to supercede the ANALYSIS eland mode. However the ANALYSIS eland mode will be kept as it is to allow customers to migrate at their leisure.
ANALYSIS eland_pairThis mode allows alignment of paired reads and is based heavily on ANALYSIS eland_extended. A single-read alignment is ndone for each half of the pair, and then the best-scoring alignments are compared to find the best paired-read alignment. USE_BASES optionThe USE_BASES option is used to identify which bases of a full read produced by a sequencing run should be used for the alignment analysis. Usually USE_BASES consists of a string, one character per sequencing cycle, to identify how each cycle should be handled. Each character in the string identifies whether the corresponding cycle should be aligned. The following notation is used:
Best illustrated by examples:
Special values:
A common pitfall here:
Setting parameters on a lane-by-lane basisThe previous section showed that the ANALYSIS parameter can be set on a lane-by-lane basis. Other parameters can also be set in this manner, and you will need to do this for any parameters specific to a particular lane's analysis. For a monotemplate lane you need to specify the file of templates, e.g. The other big application of this to set QF_PARAMS on a lane-by-lane basis Note that the best way to filter out individual tiles is to set BAD_TILES to be a list of the tiles you want to filter. You can do this retrospectively as described below. The --FORCE optionWithout the "--FORCE" option GERALD will not create any directories and files and only operate in a diagnostic mode, in which parameters are displayed. In order to modify the run-folder structure on the hard-drive, this option needs to be specified. Rerunning the analysisThe config.txt file you use to run GERALD is copied to the output directory. The idea is that if you want to change parameters and rebuild the analysis you can (usually) modify the config file and do GERALD.pl config.txt --FORCE If you combine this with the OUT_DIR option, you can force GERALD to overwrite an existing Makefile. This way you can then modify the analysis without directly editing the Makefile. Contaminant filteringThe contents of a file of contaminant sequences may be filtered out. It attempts to do this in a rigorous way by comparing the alignment of each read against the data versus its best alignment to the contaminant sequences. A one-sequence-per-line ASCII file is expected, each sequence being at least READ_LENGTH bases in length. The name of this file should be specified in CONTAM_FILE to switch contaminant filtering on. This file is assumed to live in GENOME_DIR, else you need to specify its location in CONTAM_DIR. More details on the files generated during contaminant filtering may be found here. Setting up email reportingOne of the scripts called by GERALD sends an email report on successful completion of a run. It does this by talking directly to an SMTP server. You will need to specify such a server. See the pipeline installation page for more details. Building an SRF archiveStarting from version 1.0, the pipeline is distributed with a modified version of io_lib and allows the generation of SRF archives. This is done by adding the following line in the config.txt: 1:SRF_ARCHIVE_REQUIRED yes Including the Quality Calibration scores to the SRF archiveThe modified version of io_lib distributed with the pipeline also supports the inclusion of the quality calibration scores into the SRF archives. To do so, add the following line in the config.txt: 1:SRF_QCAL yes Extracting the data from the archiveThis is done by an io_lib utility (srf2illumina). The standard distribution of io_lib will be able to extract the data from the archive, except the quality scores. For these, the modified version distributed with the pipeline is currently required. NotesGERALD stands for "Generation of Recursive Analyses Linked by Dependency". An earlier version of the sequence analysis pipeline was called GOAT ("General Oligo Analysis Tool"). |
![]() |
Document generated by Confluence on Jul 25, 2008 16:42 |