This page last changed on Sep 14, 2007 by maising.

This is a user manual/FAQ for GERALD. GERALD is the program managing the part of the pipeline used to do sequence alignments, statistics and visualisation of output. For a general introduction to the usage of the analysis pipeline see Pipeline usage.

Basic usage

make is a tool originally developed for and widely used for the compilation of large software projects from multiple source files. Each file that needs to be generated is described in terms of the files it depends upon and make figures out the order in which to build everything. These descriptions are stored in a file called a "makefile." The analysis of Genome Analyzer data may be viewed in a similar way to a compilation problem and make was found to be a useful way of building large, complicated analyses. GERALD is a wrapper to shield users from the necessity of writing their own makefiles. What it does is take a simple text-based config file (containing basic information such as the type of analysis to be performed and the reference files for a sequence alignment) and translates it to a makefile that describes the analysis in terms of "what file depends on what." This makefile can then be used to build the analysis using GNU make by typing "make".

GERALD is usually run automatically as part of an overall pipeline run (see Pipeline usage). However there will frequently be occasions where it needs to be run on its own. The standard way to run GERALD is to give it a config file of parameters, as follows:

  1. Edit a suitable config file, say "config.txt" - see section below for an example.
  2. To create a Makefile for sequence aligment type
    GERALD.pl config.txt --FORCE

As a result of running GERALD, a new directory named something like "GERALD_16-08-2007_user" will have been created (the date is the current date, and "user" is your computer login). If you want to rerun the analysis and change parameters, you can rerun GERALD with the new parameters. A new directory will be created, and no information will be overwritten by default.

In earlier versions of GERALD, analysis parameters were given to GERALD via the command line. An example of a command line GERALD invocation would be:

GERALD.pl --EXPT_DIR
/data/070813_ILMN-1_0217_FC1234/Data/C1-27_Firecrest1.9.0_23-08-2007-user/Bustard1.9.0_23-08-2007_user/
  --FORCE --GENOME_DIR /data/Genomes --GENOME_FILE BAC_plus_vector.fa

Command-line arguments take precedence over parameters set in the config file, so command-line parameter specification can be used to override paramters in a default config file. Parameters can also be set as environment variables.

The meaning of the parameters will be explained below.

Running the analysis

After that you can kick off the analysis with e.g.

  • make - basic analysis
  • make -j 3 - to parallelize 3 ways. The extent of the parallelisation will depend on the setup of your computer or computing cluster. It does not make much sense to specify a number that exceeds the number of processors/nodes on your computing system. Load-sharing facilities can be exploited automatically, for example on a cluster. See Pipeline parallelisation for more information on parallelisation.
  • nohup make & - to run in background

Additional make options

You may do a partial build of your analysis as follows.
To build all files for tile 12 of lane 3:

make TILE=s_3_0012

To build tiles 5,10,15,20,... from lanes 3 and 6

make TILE=s_[36]_[0-9][0-9][0-9][05]

This feature may be useful for a "sneak preview" of your results, after which a full analysis may be built as described above.

An example GERALD config file

Here is an example config file that uses most of the new features and documents most of the parameters.

# Example config file for GERALD.pl
# Lines preceded by a hash are comments

# notify user at end of analysis run; this is optional
EMAIL_LIST user@illumina.com
EMAIL_SERVER mailserver
EMAIL_DOMAIN illumina.com

# The email report sent at the end of the analysis run contains hyper-links
# with the following prefix to the run-folder; this is optional
WEB_DIR_ROOT file://server.illumina.com/share

# Ignore the first and last base of the read. Each n indicates that a cycle needs to be 
# ignored. Each Y indicates that the cycle is used for alignment. Wild cards are expanded
# to the full length of the read.
# For example, the first base could be ignored because it is part of the sequencing primer.
# The last base could be ignored because it is not prephasing corrected and may have higher 
# error rates.
USE_BASES nY*n

# Specify the genome reference for alignment with Eland. This reference has previously been 
# produced by squashGenome
ELAND_GENOME /home/user/Genomes/Eland/BAC_plus_vector/

# Specify where the genome file can be found
GENOME_DIR /home/user/Genomes

# Align against a genomic sample; i.e., allow alignments to arbitrary
# positions of the provided reference.
ANALYSIS eland

# Run user-defined command below once all 'make all' targets have been built
POST_RUN_COMMAND /yourPath/yourCommand yourArgs

# Now some lane-specific options...

# Only use 20 cycles for lane 7
7:USE_BASES nY20

# Only output sequence information for lanes 5, 6, 7. No alignments are performed. 
567:ANALYSIS sequence
567:USE_BASES all
# Lane 8 is primers only so switch it off completely
8:ANALYSIS none

# filtering parameters
# (default is '(CHASTITY>=0.6)'
3:QF_PARAMS '((NEIGHBOUR>=3.0)&&(CHASTITY>=0.7)&&(X_COORD>50))'

Some more optional parameters:

# path to the experiment directory (if not specified on command-line, or auto-completed
# by goat_pipeline.py)
EXPT_DIR data/070813_ILMN-1_0217_FC1234/Data/C1-27_Firecrest1.9.0_23-08-2007-user/Bustard1.9.0_23-08-2007_user/

# Output directory (if other than a new GERALD folder
# inside the EXPT_DIR)
OUT_DIR /home/user/Test4

## Specify read length for experiment; this is the read length used for
# the alignments, not the sequenced read length; consequently, its value
# has to be less or equal to the sequence length. This is useful to force the 
# wild card expansion of USE_BASES to a predefined value
READ_LENGTH 25

# Found some bad tiles, these still get aligned but
# get excluded from coverage
BAD_TILES s_1_0001 s_2_0003

Paired end analysis options:

# Use paired end alignment mode of Eland
ANALYSIS eland_pair

# Use all bases of first read, ignore first and last base of second read
USE_BASES Y*,nY*n

# Ignore first base on both first and second read, use 25 bases each and ignore any other bases
6:USE_BASES nY25

Parameters and points to note

The ANALYSIS variable

You can set a variable ANALYSIS to define what goes on in each lane.

ANALYSIS eland

Setting e.g.
6:ANALYSIS eland
runs an ELAND whole-genome analysis for lane 6. Documentation can be found in Whole genome alignments using ELAND. You will need to use ELAND if your reference sequence exceeds say 1Mb in size. No coverage files will be generated.

ANALYSIS sequence

Setting e.g.
6:ANALYSIS sequence
produces a file s_6_sequence.txt. This file contains all sequences in a lane of a chip in a format that is meant to be exportable. The content of this file is affected by the following parameters.

USE_BASES - see below.
QF_PARAMS - only sequences for clusters passing the quality filtering will appear. Set QF_PARAMS '(1==1)' or similar to pass all of them.
SEQUENCE_FORMAT - allowed values here are --fasta, --fastq or --scarf. The fasta and fastq file formats are widely used. Scarf stands for Solexa Compact ASCII Read Format and is a one-line-per-read text format used at Illumina.
QUALITY_FORMAT - allowed values here are --numeric and --symbolic. Mode --numeric gives the quality values as a space separated string of numbers, --symbolic gives them as a compact string of ASCII characters. Subtract 64 from the ASCII code to get the corresponding quality values, but bear in mind that under the Solexa numbering scheme quality values can theoretically be as low as -5, which has an ASCII encoding of 59=';'.

ANALYSIS default

This aligns each read against a reference sequence using the phageAlign program. All the other analysis modes are based on this mode to some extent.

A full description of the files created during an analysis is given here, but briefly:

_align.txt file for each _seq.txt file in lane
_score.txt from each align.txt - scores for filtered alignments
_realign.txt from each align.txt and score.txt - realignment of filtered alignments
_rescore.txt from each realign.txt
_qalign.txt from each _prb.txt file
_cov.txt from all realign.txt files - coverage for this lane

The phageAlign program

ANALYSIS none

If you set e.g.
8:ANALYSIS none
it will ignore lane 8. If there are no files for that lane anyway you don't need to do this.

ANALYSIS monotemplate

6:ANALYSIS monotemplate
This makes lane 6 a monotemplate analysis, where we expect each read to be one of a small number (say 20 or less) of known template sequences. The reads are aligned using phageAlign in tag mode, which treats each line of the reference sequence as a separate tag.

No coverage plots will be produced since they are not relevant here. Some monotemplate-specific output is produced instead.

ANALYSIS expression (DEPRECATED)

This mode is meant for experiments that measure gene expression by counting the occurrence of sequence tags. Again this makes use of phageAlign's tag mode to align the reads. It is essentially a cut-down version of monotemplate mode that omits some output files that become unmanageably large when aligning against large numbers of templates.

Note that in this mode the quality values are taken directly from the base caller (from the _prb.txt files via the _qraw.txt files). In default and eland modes a _sequence.txt file is still produced, but its quality values are taken from the recalibrated quality values contained in the _qcal.txt files.

New ANALYSIS modes in pipeline version 0.3

The ANALYSIS eland_extended and ANALYSIS eland_pair modes are new for pipeline version 0.3. They are described in much more detail here.

ANALYSIS eland_extended

This is intended to supercede the ANALYSIS eland mode. However the ANALYSIS eland mode will be kept as it is to allow customers to migrate at their leisure.
Key benefits:

  • Better handling of >32 base reads
  • Each alignment given a confidence value based on its base quality scores
  • A single file of sorted alignments produced for each lane
ANALYSIS eland_pair

This mode allows alignment of paired reads and is based heavily on ANALYSIS eland_extended. A single-read alignment is ndone for each half of the pair, and then the best-scoring alignments are compared to find the best paired-read alignment.

USE_BASES option

The USE_BASES option is used to identify which bases of a full read produced by a sequencing run should be used for the alignment analysis. Usually USE_BASES consists of a string, one character per sequencing cycle, to identify how each cycle should be handled. Each character in the string identifies whether the corresponding cycle should be aligned. The following notation is used:

  • 'n' means ignore the cycle
  • 'Y' means use the cycle for the alignment
  • a comma ',' denotes a read boundary (for multiple reads)
  • an asterisk '*' means "fill up the read as far as possible with the preceding character" (as in a UNIX regular expression)
  • a number means that the previous character is to repeated as many times
    Unspecified cycles are set to 'n' by default.

Best illustrated by examples:
Examples/use cases

  1. First base ignored, bases 2-4 used
    USE_BASES nYYY
  2. Single 30-mer read
    USE_BASES Y30
  3. Single 30-mer read, first base primer
    USE_BASES nY30
  4. Single 30-mer read, first base primer, ignore last base for prephasing
    USE_BASES nY30n
  5. Single read, first base primer, ignore last base. Length of read is automatically set to (number of sequencing cycles - 2)
    USE_BASES nY*n
  6. Paired 30-mer read: first base of each read primer, 31 bases sequenced:
    USE_BASES nY*,nY*

    This is equivalent to

    USE_BASES nY*

    if the mode "ANALYSIS eland_pair" is used.

Special values:

  • "USE_BASES all" means use all bases.
  • "USE_BASES odd" means use bases from odd-numbered cycles only. This is a relic from the days where we had enough time, inclination and disk space to acquire images during both the sequencing and deblocking steps.

A common pitfall here:

  • The lower case "n" is important - this tells the pipeline to completely ignore that cycle. An upper case "N" signifies a deblock cycle, causing the pipeline to try (and fail) to produce extra deblock-related output for that cycle.

Setting parameters on a lane-by-lane basis

The previous section showed that the ANALYSIS parameter can be set on a lane-by-lane basis. Other parameters can also be set in this manner, and you will need to do this for any parameters specific to a particular lane's analysis.

For a monotemplate lane you need to specify the file of templates, e.g.
6:GENOME_FILE mono1.txt
You can specify a parameter multiple lanes in a single line, e.g.
67:GENOME_FILE mono1.txt
does lanes 6 and 7.

The other big application of this to set QF_PARAMS on a lane-by-lane basis
e.g. suppose in clusters near one edge of tile 4 of lane 3 looked dodgy, you could do
3:QF_PARAMS='((NEIGHBOUR>=5.0)&&(PURITY>=0.7)&&((TILE!=4)||(X_COORD>50)))'
The filter param can be any boolean Perl expression, the variable names are just aliases
to fields in a tab separated text file.

Note that the best way to filter out individual tiles is to set BAD_TILES to be a list of the tiles you want to filter. You can do this retrospectively as described below.

The --FORCE option

Without the "--FORCE" option GERALD will not create any directories and files and only operate in a diagnostic mode, in which parameters are displayed. In order to modify the run-folder structure on the hard-drive, this option needs to be specified.

Rerunning the analysis

The config.txt file you use to run GERALD is copied to the output directory. The idea is that if you want to change parameters and rebuild the analysis you can (usually) modify the config file and do

GERALD.pl config.txt --FORCE

If you combine this with the OUT_DIR option, you can force GERALD to overwrite an existing Makefile. This way you can then modify the analysis without directly editing the Makefile.

Contaminant filtering

The contents of a file of contaminant sequences may be filtered out. It attempts to do this in a rigorous way by comparing the alignment of each read against the data versus its best alignment to the contaminant sequences. A one-sequence-per-line ASCII file is expected, each sequence being at least READ_LENGTH bases in length.

The name of this file should be specified in CONTAM_FILE to switch contaminant filtering on. This file is assumed to live in GENOME_DIR, else you need to specify its location in CONTAM_DIR. More details on the files generated during contaminant filtering may be found here.

Setting up email reporting

One of the scripts called by GERALD sends an email report on successful completion of a run. It does this by talking directly to an SMTP server. You will need to specify such a server. See the pipeline installation page for more details.

Notes

GERALD stands for "Generation of Recursive Analyses Linked by Dependency". An earlier version of the sequence analysis pipeline was called GOAT ("General Oligo Analysis Tool").

Document generated by Confluence on Dec 19, 2007 18:32