This page last changed on Dec 11, 2006 by ac.

This is a user manual/FAQ for GERALD. GERALD is the program managing the part of the pipeline used to do sequence alignments, statistics and visualisation of output.

Basic usage

make is a tool originally developed for and widely used for the compilation of large software projects from multiple source files. Each file that needs to be generated is described in terms of the files it depends upon and make figures out the order in which to build everything. These descriptions are stored in a file called a "makefile." The analysis of Solexa data may be viewed in a similar way to a compilation problem and make was found to be a useful way of building large, complicated analyses. However, "user friendly" is not a term that one would associate with the make syntax, to say the least, so GERALD is an attempt to shield users from the necessity of writing their own makefiles. What it does is take a simple text-based config file and translates it to a makefile that describes the analysis in terms of "what file depends on what." This makefile can then be used to build the analysis using GNU make by typing "make" under either Linux or Cygwin (or related operating systems).

GERALD is usually run automatically as part of an overall pipeline run. However there will frequently be occasions where it needs to be run on its own. The standard way to run GERALD is to give it a config file of parameters, as follows:

  1. Edit a suitable config file, say "config.txt" - see section below for an example.
  2. Do either
    cat config.txt | GERALD.pl - --FORCE

    or, equivalently,

    GERALD.pl config.txt --FORCE

As a result of running GERALD, a new directory named something like "GERALD_16-05-2006_user" will have been created (the date is the current date, and "user" is your computer login). If you want to rerun the analysis and change parameters, you can rerun GERALD with the new parameters. A new directory will be created, and no information will be overwritten by default.

In earlier versions of GERALD, analysis parameters were given to GERALD via the command line. An example of a command line GERALD invocation would be:

GERALD.pl --EXPT_DIR
  /data/00002887/Data/C1-36_Phoenix1.8.11_09-02-2006_user.2/Bustard1.8.11_09-02-2006_user/
  --FORCE --GENOME_DIR /data/Genomes --GENOME_FILE BAC_plus_vector.fa

Command-line arguments take precedence over parameters set in the config file, so command-line parameter specification can be used to override paramters in a default config file. Parameters can also be set as environment variables.

The meaning of the parameters will be explained below.

Using config files

Running the analysis

First you may wish to run 'make self_test.' This performs some basic checks on your setup, e.g. that all scripts it will need to call are present and are executable. It also performs some consistency checks on the paramters in the config file. A 'make self_test' is performed as the first step of any of the builds below.

After that you can kick off the analysis with e.g.

  • make - basic
  • make -j 3 - to parallelize 3 ways. The extent of the parallelisation will depend on the setup of your computer or computing cluster. It does not make much sense to specify a number that exceeds the number of processors/nodes on your computing system. Load-sharing facilities can be exploited automatically, for example on a cluster using OpenMosix.
  • nohup make & - to run in background

You may do a partial build of your analysis as follows.
To build all files for tile 12 of lane 3:

make TILE=s_3_0012

To build tiles 5,10,15,20,... from lanes 3 and 6

make TILE=s_[36]_[0-9][0-9][0-9][05]

This feature may be useful for a "sneak preview" of your results, after which a full analysis may be built as described above.

An example GERALD config file

Here is an example config file that uses most of the new features and documents most of the parameters.

# Example config file for GERALD.pl version 1.15 and above
# Lines preceded by a hash are comments

# notify user at end of analysis run
EMAIL_LIST user@solexa.com
EMAIL_SERVER mailserver
EMAIL_DOMAIN solexa.com

# The email report sent at the end of the analysis run contains hyper-links
# with the following prefix to the run-folder
WEB_DIR_ROOT file://server.solexa.co.uk/share

# Specify read length for experiment; this is the read length used for
# the alignments, not the sequenced read length; consequently, its value
# has to be less or equal to the sequence length.
READ_LENGTH 25

# Ignore the first base (for example, it could be part of the primer).
# The number of Ys has to match the value of READ_LENGTH - yes, that
# can be awkward to enter manually.
USE_BASES nYYYYYYYYYYYYYYYYYYYYYYYYY

# Specify the genome reference for alignment
GENOME_FILE BAC_plus_vector.fa

# Specify where the genome file can be found
GENOME_DIR /home/user/Genomes

# Align against a genomic sample; i.e., allow alignments to arbitrary
# positions of the provided reference.
ANALYSIS default

# Now some lane-specific options...

# Only align 20 cycles for lane 7
7:READ_LENGTH 20

# Lanes 567 are monotemplates; i.e., the reference contains a set of tags,
# and the alignment software is constrained to match at the first base position
567:ANALYSIS monotemplate
567:USE_BASES all
56:GENOME_FILE hayward12.fa
7:GENOME_FILE magnificent7.fa
# Lane 8 is primers only so switch it off completely
8:ANALYSIS none

# filtering parameters
# (default is '(CHASTITY>=0.6)' and 'PURE_BASES 12'
3:QF_PARAMS (NEIGHBOUR>=5.0)&&(PURITY>=0.7)&&(X_COORD>50)
3:PURE_BASES 1

# Found some bad tiles, these still get aligned but
# get excluded from coverage
BAD_TILES s_1_0001 s_2_0003

Some additional options:

# path to the experiment directory (if not specified on command-line)
EXPT_DIR /data/00002887/Data/C1-36_Firecrest1.8.19_20-06-2006_user/Bustard1.8.19_20-06-2006_user

# Output directory (if other than a new GERALD folder
# inside the EXPT_DIR)
OUT_DIR /home/user/Test4

# Run command below once all 'make all' targets have been built
POST_RUN_COMMAND /yourPath/yourCommand yourArgs

Parameters and points to note

The ANALYSIS variable

You can set a variable ANALYSIS to define what goes on in each lane.

ANALYSIS default

This aligns each read against a reference sequence using the phageAlign program. All the other analysis modes are based on this mode to some extent.

A full description of the files created during an analysis is given here, but briefly:

_align.txt file for each _seq.txt file in lane
_qhg.txt file for each _sig.txt file in lane
_score.txt from each align.txt - scores for filtered alignments
_realign.txt from each align.txt and score.txt - realignment of filtered alignments
_rescore.txt from each realign.txt
_qalign.txt from each _prb.txt file
_cov.txt from all realign.txt files - coverage for this lane

ANALYSIS none

If you set e.g.
8:ANALYSIS none
it will ignore lane 8. If there are no files for that lane anyway you don't need to do this.

ANALYSIS eland

Setting e.g.
6:ANALYSIS eland
runs an ELAND whole-genome analysis for lane 6. Documentation can be found in Whole genome alignments using ELAND. You will need to use ELAND if your reference sequence exceeds say 1Mb in size. No coverage files will be generated as they are unmanageably large for larger reference sequences.

ANALYSIS monotemplate

6:ANALYSIS monotemplate
This makes lane 6 a monotemplate analysis, where we expect each read to be one of a small number (say 20 or less) of known template sequences. The reads are aligned using phageAlign in tag mode, which treats each line of the reference sequence as a separate tag.

No coverage plots will be produced since they are not relevant here. Some monotemplate-specific output is produced instead.

ANALYSIS expression

This mode is meant for experiments that measure gene expression by counting the occurrence of sequence tags. Again this makes use of phageAlign's tag mode to align the reads. It is essentially a cut-down version of monotemplate mode that omits some output files that become unmanageably large when aligning against large numbers of templates.

ANALYSIS sequence

Setting e.g.
6:ANALYSIS sequence
produces a file s_6_sequence.txt. This file contains all sequences in a lane of a chip in a format that is meant to be exportable. The content of this file is affected by the following parameters.

USE_BASES - if USE_BASES all is set, all sequenced bases will show up in this file. Otherwise only cycles which having a Y at the corresponding position in the USE_BASES string will appear.
QF_PARAMS - only sequences for clusters passing the quality filtering will appear. Set QF_PARAMS (1==1) or similar to pass all of them.
SEQUENCE_FORMAT - allowed values here are --fasta, --fastq or --scarf. The fasta and fastq file formats are widely used. Scarf stands for Solexa Compact ASCII Read Format and is a one-line-per-read format used here at Solexa.
QUALITY_FORMAT - allowed values here are --numeric and --symbolic. Mode --numeric gives the quality values as a space separated string of numbers, --symbolic gives them as a compact string of ASCII characters. Subtract 64 from the ASCII code to get the corresponding quality values, but bear in mind that under the Solexa numbering scheme quality values can theoretically be as low as -5, which has an ASCII encoding of 59=';'.

Note that in this mode the quality values are taken directly from the base caller (from the _prb.txt files via the _qraw.txt files). In default and eland modes a _sequence.txt file is still produced, but its quality values are taken from the recalibrated quality values contained in the _qcal.txt files.

Setting parameters on a lane-by-lane basis

The previous section showed that the ANALYSIS parameter can be set on a lane-by-lane basis. Other parameters can also be set in this manner, and you will need to do this for any parameters specific to a particular lane's analysis.

For a monotemplate lane you need to specify the file of templates, e.g.
6:GENOME_FILE mono1.txt
You can specify a parameter multiple lanes in a single line, e.g.
67:GENOME_FILE mono1.txt
does lanes 6 and 7.

The other big application of this to set QF_PARAMS on a lane-by-lane basis
e.g. suppose in clusters near one edge of tile 4 of lane 3 looked dodgy, you could do
3:QF_PARAMS='(NEIGHBOUR>=5.0)&&(PURITY>=0.7)&&((TILE!=4)||(X_COORD>50))'
The filter param can be any boolean Perl expression, the variable names are just aliases
to fields in a tab separated text file.

Or: you could set a shorter READ_LENGTH for an individual lane.

Note that the best way to filter out individual tiles is to set BAD_TILES to be a list of the tiles you want to filter. You can do this retrospectively as described below.

USE_BASES option

"USE_BASES odd" means use bases from odd-numbered cycles only. This is a relic from the days where we had enough time, inclination and disk space to acquire images during both the sequencing and deblocking steps.
"USE_BASES all" means use all bases.
"USE_BASES nnYYnnYYnn" (example) allows you to specify bases to use (Y=use). Some of our sample preps are such that the first base sequencing is actually part of the primer, and the most common application of this is to allow these to be ignored by setting USE_BASES nYYYYYYY...
Two common pitfalls here:

  • The lower case "n" is important - this tells the pipeline to completely ignore that cycle. An upper case "N" signifies a deblock cycle, causing the pipeline to try (and fail) to produce extra deblock-related output for that cycle.
  • The number of "Y"s in your USE_BASES string needs to be equal to your READ_LENGTH. Therefore if you alter one for a particular lane you'll need to alter the other to match. The pipeline will produce an error message if you get this wrong.

The --FORCE option

Without the "--FORCE" option GERALD will not create any directories and files and only operate in a diagnostic mode, in which parameters are displayed. In order to modify the run-folder structure on the hard-drive, this option needs to be specified.

Rerunning the analysis

The config.txt file you use to run GERALD is copied to the output directory. The idea is that if you want to change parameters and rebuild the analysis you can (usually) modify the config file and do

GERALD.pl config.txt --FORCE

If you combine this with the OUT_DIR option, you can force GERALD to overwrite an existing Makefile. This way you can then modify the analysis without directly editing the Makefile.

Contaminant filtering

The contents of a file of contaminant sequences may be filtered out. It attempts to do this in a rigorous way by comparing the alignment of each read against the data versus its best alignment to the contaminant sequences. A one-sequence-per-line ASCII file is expected, each sequence being at least READ_LENGTH bases in length.

The name of this file should be specified in CONTAM_FILE to switch contaminant filtering on. This file is assumed to live in GENOME_DIR, else you need to specify its location in CONTAM_DIR. More details on the files generated during contaminant filtering may be found here.

Setting up email reporting

One of the scripts called by GERALD sends an email report on successful completion of a run. It does this by talking directly to an SMTP server. You will need to specify such a server. See the pipeline installation page for more details.

Getting started with GERALD on Cygwin

This section is specific to Cygwin.

  • To make an appropriate config.txt file, you'll need to modify the various paths so that they point to the mounting of the experiment drives on your PC. This is one example:
    GENOME_FILE phi.fa
    GENOME_DIR /cygdrive/c/User/Genomes

    Note that this is an old phi-X experiment so you'll need to make sure the Phi-X genome file is in GENOME_DIR as phi.fa. Now do

    GERALD.pl config.txt --FORCE

    This should build the new folder with Makefile and config.txt.

  • On my PC some of the programs that build the graphics at the end fail. You will need to have installed
    ImageMagick (cluster has version 5.4.7, but later versions probably not a problem)
    gnuplot (cluster has 3.7 which works fine, but my PC has 4.0 which doesn't like some of the options. I will work on making things compatible)

Other TBDs:

  1. Want a sensible way of producing combined coverage plots for lanes (Do we really need coverage files to be stored at all? Maybe just generate stats and throw files away. A browser like the Ensembl browser or gap4 could build them on the fly as it will need to look directly at the align files anyway to do the stacked read pictures.)
  2. Probably should make the syntax more "XML-like"

Some release notes for pipeline modules

This is not a full version history of every pipeline module. Whenever I get around to it (not very often probably) I will add an entry to the top of this table showing changes to the current "best version" you should be using for the important software modules.

Date Version Comments
23.03.06 GERALD v1.27, runReport.pl v1.3, create_error_thumbnails.pl v1.9, plotScoreHist.pl v1.3, plot_Error_Curves.pl v1.2 Produces two new thumbnail pages Hist.htm and Perfect.htm - documentation to follow. Also fixed links in runReport.pl
09.03.06 GERALD v1.26 plus new script runReport v1.2 Now sends experiment reports via email. Documentation here.
06.03.06 GERALD v1.23, plus ELAND analysis will require new scripts convertToFasta.pl and convertFromELAND.pl and the latest versions of everything in Eland
  1. Now produces IVC.htm and All.htm thumbnail pages
  2. Any tiles listed as BAD_TILES in the config file are completely ignored
  3. "ANALYSIS none" now works properly (but note that the syntax for multiple lanes is now "12:ANALYSIS none" - same as for all other variables)
  4. Now supports whole human genome analysis - "ANALYSIS eland." A whole new kettle of worms, documented here.
  5. Tries to remove half-finished files if run crashes or is halted ...
  6. ... and can run "make tidy" to remove any zero size files lying around
20.02.06 QUAHOG.cpp v1.5, GERALD.pl v1.20, PhageAlign.cpp v1.17
  1. Now adds .exe where necessary when run under Cygwin
  2. Now creates error rate thumbnail page
  3. Extra #include added to QUAHOG (prevents compilation problems under Cygwin)
  4. Latest version of PhageAlign - unnecessary memory allocations removed
31.01.06 QUAHOG.cpp v1.4, GERALD.pl v1.19 Added PURE_BASES option - purity values can now be calculated over a range of intensities
06.12.05 PhageAlign.cpp v1.15, GlobalUtilities.cpp v1.12, GERALD.pl v1.17 None
Document generated by Confluence on Mar 09, 2007 16:11