This page last changed on Apr 03, 2007 by maising.

Overview

This document gives a brief overview of how to use the Solexa data analysis pipeline. For an introduction to the pipeline, see Pipeline documentation. For help on the installation, see Pipeline installation.

The analysis pipeline consists of three main modules:

  1. Image analysis (Firecrest)
  2. Base-caller (Bustard)
  3. Sequence alignment (GERALD/Eland)

For each of these modules, an analysis run will create a new subfolder in the run-folder (see Run-Folder specification). The analysis is run through the "make" utility which is commonly used for the compilation of software projects. Make serves as a wrapper around the core analysis modules, but all sub-modules can be called directly and sophisticated users may wish to create their own wrappers. Performance-critical modules are written in C++, most other modules in Perl and Python.

The subdirectories and Makefiles for the three core modules are created by three different scripts:

  1. Pipeline/Goat/goat_pipeline.py (image analysis)
  2. Pipeline/Goat/bustard.py (base-calling)
  3. Pipeline/Gerald/GERALD.pl (sequence alignment)

Any of the first two scripts can invoke the next scripts automatically, so there is no need for a user to call more than one script for any given analysis run. Of course, typically the goat_pipeline.py script would be the script that needs to be run, but the other two scripts enable the user to reanalyse data at any given stage in the process, for example using different parameters or a new pipeline version.

A run of the pipeline is essentially a two-stage process:

  1. Generate the folders and Makefiles using one of the above scripts.
  2. Start the pipeline by executing "make".

Parameters

For an optimal analysis run, a number of calibration and input parameters are needed by the pipeline. The pipeline can attempt to auto-calibrate most of the calibration parameters. However, the user can manually specify a number of parameters if desired, in most cases in the form of a parameter file. It is useful to understand the meaning of these parameters:

  1. Instrument calibration parameters
    1. Default offsets: There are small pixel offsets between the four images in different colours taken of each tile. These are due to slightly different optical paths for each image. Additionally, the pipeline also corrects for linear rescalings of the image.
    2. Frequency cross-talk matrix: The Solexa instruments use two different lasers to excite four dyes attached to the four nucleotides. The frequency emission of these four dyes overlaps and the four images are not independent. Like for Sanger sequencing, the frequency cross-talk has to be deconvolved using a frequency cross-talk matrix.
    3. Phasing estimates: Depending on the efficiency of the fluidics and the sequencing reactions, a small number of molecules in each cluster may run ahead of or fall behind of the current incorporation cycle. This effect, which we call "phasing" (molecules falling behind the incorporation cycle) and "prephasing" (molecules running ahead), can be mitigated by applying a correction during the base-calling step.
  2. Sample information
    1. Reference genome etc: This information is necessary to do the alignment of the sequences to a reference.

Default offsets are needed for the image analysis, frequency matrix and phasing estimates for the base-calling and the sample information for the alignment. All instrument-specific calibration parameters are estimated automatically during a pipeline run. However, for the default offsets the estimated values are currently not applied by default; the pipeline has to be told explicitly to auto-generate the values (see usage examples below).

A more detailed description of the parameters is given below.

Parameters and initial setup

As most of the parameters get auto-detected during a run, there is not much need to know about them unless something goes wrong with the run. In most cases, you can probably skip the following sections and move straight on to "Basic usage". However, it is important to note the following points:

  • For the first run of the pipeline, you need to create a folder as described in the next section.
  • For the first run of the pipeline, and after every modification to the optical system - for example after replacements of the filters, you need to use the "--offsets=auto" option for the analysis.
  • For samples that do not contain a mostly random base composition (for example expression data), the auto-calibration of the phasing and matrix is known to fail, and at this point in time you will need a genomic control lane on you chip. You then want to use the "--matrix=auto<n>" and "--phasing=auto<n>" options as described in Goat documentation - Firecrest, Bustard.

Directory setup

Default offsets and frequency matrices are usually fairly constant for runs on the same instrument and the user may not need to change them frequently. The pipeline provides a mechanism for specifying these parameters automatically.

Create a directory called

Instruments/<instrument_name>

for each Solexa instrument you are analysing data from, in the same directory in which the run-folder is located. <instrument_name> is the hostname of the computer that is attached to the instrument, and on which the instrument control software runs on. For example, the directory for the run-folder "/data/00002887" would be called "/data/Instruments/slxa-a1/". In this directory you need place a files called "default_offsets.txt" (and optionally another file called "matrix.txt"); the format of these files is explained below. The values found in these files will be used by the pipeline for all analysis runs; the pipeline automatically keeps the file "default_offsets.txt" up-to-date.

The default location of the Instruments directory can be overwritten by using the environment variable "INSTRUMENT_DIR", as in

export INSTRUMENT_DIR=/home/user/Instruments

If no instrument directory exists, the pipeline can auto-create one for you (see the "--offsets=auto" option for the pipeline).

Default offsets

A typical file "default_offsets.txt" would look like this:

# Default offsets
 0.00  0.00  0.00000  0.00000
-1.05 -1.62 -0.00017  0.00007
-1.20 -0.47 -0.00143 -0.00142
 0.29 -0.92 -0.00159 -0.00142

It contains four lines with four values each. The first two columns in a row correspond to the values of the x- and y-offsets (in pixels) of the 4 images corresponding to A, C, G, T respectively, with respect to the A image (so the first two values are identical to zero by definition). The next two columns indicate scale factors applied to the image. A scale factor of 0 indicates that the image does not need to be rescaled. A scale factor of 0.001 for a 1000 x 1000 pixel image would indicate that images taken in the corresponding frequency channel tend to be 0.001 x 1000 = 1 pixels larger than the reference channel.

Each analysis run creates a file called "Data/default_offsets.txt" in the current run-folder, which is then used for any subsequent analysis of the same run-folder; so any second run of the image analysis will automatically use the correct parameters. If a file "Instruments/<instrument>/default_offsets.txt" (or in a different location if the environment variable INSTRUMENT_DIR is defined) exists, its values will be used for the first analysis run for the run-folder, and its values will be updated during the first run only.

Frequency cross-talk matrix

The frequency cross-talk matrix would be specified as follows:

# frequency response matrix definition
> C
> A
> T
> G
1.18 1.29 0.00 0.00
0.18 1.03 0.00 0.00
0.00 0.00 1.43 0.80
0.00 0.00 0.00 0.71

The lines starting in ">" specify the order of the rows and columns (in terms of the bases they correspond to). (If a file "Instruments/<instrument>/matrix.txt" exists, its values will be used if the "--matrix=default" option has been specified to the pipeline - see below).

The frequency cross-talk matrix gets estimated during the analysis run; its estimate can be found in the file "C..._Firecrest.../Matrix/s_matrix.txt".

Phasing

The phasing estimates are produced before a run of the base-caller. The file "C..._Firecrest.../Bustard.../Phasing/phasing.txt" will contain, for each tile analysed during the analysis, an estimate of the phasing and prephasing rate. The first four columns in this file indicate: lane, tile, phasing estimate, prephasing estimate. These values can be applied by the user to a subsequent run of the base-caller. As the estimation procedures uses statistical averaging over many clusters and sequences to estimate the correlation of signal between different cycles, the phasing estimates tend to be more accurate for tiles with larger numbers of clusters and a mixture of different sequences. Samples containing only a small number of different sequences do not produce reliable estimates.

Sample information

See GERALD User Guide and FAQ for more information on configuration files for the sequence alignment. In the absence of a mechanism to auto-generate config files from sample-sheet information, this part of the analysis run setup is the most laborious, and the easiest to get wrong. Gerald now performs some up-front checking of the sanity of the parameters.

Basic usage

Goat

A typical analysis run would be started as follows:

Pipeline/Goat/goat_pipeline.py [--cycles=<firstcycle>-<lastcycle>|auto] [--offsets=<offsetfile>|auto]
   [--matrix=<matrixfile>|auto|auto<n>] [--phasing=<phasingratio>|auto|auto<n>] [--prephasing=<prephasingratio>]
   [--GERALD=<configfile>] [--make] <run-folder> [<run-folder2>]

An example:

Pipeline/Goat/goat_pipeline.py --cycles=1-27 --GERALD=/data/00002887/config.txt /data/00002887

would run a sanity check on the run-folder /data/00002887 and report all detected folders and parameter files, but it would not actually create any Makefiles unless the "--make" option is specified. This is useful for checking that the run-folder is well-formed and all parameter files are found by the pipeline.

Adding the "--make" option to the command would create an analysis for the run-folder /data/00002887, using the instrument default offsets, and auto-generated matrix and phasing values. If the "--GERALD" option is specified, Gerald will also start a brief "make" run to test the consistency of the supplied config file. Example:

Pipeline/Goat/goat_pipeline.py --cycles=1-27 --GERALD=/data/00002887/config.txt --make /data/00002887

The actual analysis run can the be started by changing into the newly created directory "/data/00002887/Data/C1-26_Firecrest..." and starting

make recursive

or

make recursive -j n

to parallelise the analysis run to <n> different processes - however, this relies on your operating system or setup to work efficiently and support automatic load-sharing to multiple CPUs. For more information on pipeline parallelisation see Pipeline parallelisation. The Unix "nohup" command may be useful to redirect the pipeline standard output and keep the "make" process alive even if your terminal is killed:

nohup make recursive -j n &

More information on further make targets (for example to compress data, or to analyse subsets of the data) and command-line options to scripts (for example to auto-generate offsets) can be found in Goat documentation - Firecrest, Bustard.

Problem reports

In case of any problems with the analysis pipeline, it usually helps to capture all the output and error messages produced by a run. This can be done by simply redirecting the output, using nohup or the facilities of a cluster management system.

The appropriate forum to post problem reports to is the email address "analysis@illumina.com".

If there are Gerald related issues, it usually helps to post the config.txt file found in the GERALD output folder. For problems relating to specific tiles/files it may also be useful to send the output of "wc -l" and "ls -l" on these files.

Bustard

If base-calling is to be repeated using a newly adapted matrix or manually chosen phasing estimates, change into the "C..._Firecrest..." subfolder and do, for example,

Pipeline/Goat/bustard.py --matrix=path/matrix.txt --phasing=0.0075 --prephasing=0.01 --make .

and type again

make recursive

For more information on how to run goat_pipeline.py and bustard.py see Goat documentation - Firecrest, Bustard.

Gerald

Gerald can be run without the rest of the pipeline. A typical invocation would be

Pipeline/Gerald/GERALD.pl  gerald_config.txt --EXPT_DIR path_to_bustard_folder --FORCE

For more information on config files and GERALD.pl see GERALD User Guide and FAQ. Please also note the difference between the possible short-read alignment programs that Gerald can use: PhageAlign for exhaustive alignments of short genomes (<1 Mb), and Eland for a fast alignment to large genomes with up to two mismatches.

Read pairs

Read pairs are supported up to the base-calling stage. The simplest way to use read-pair data assumes that you have two separate run-folders for both reads. Then you simply specify both folders as arguments to "goat_pipeline.py". This will

  • Generate output only in the first run-folder - the second folder is not touched.
  • Remap all clusters across both runs to the first cycle of the first run-folder.
  • Reset the phasing correction after the end of the first read.
  • Generate sequence strings whose combined length is equal to the sum of the number of cycles found in both run-folders.

An alternative way is to save all images in a single run-folder with a continuously incremented cycle count. In this case, the option "--reset-phasing-cycle" can be used to reset the phasing correction in the cycle corresponding to the first base of the second read.

There is no direct support for read-pairs at the alignment stage right now - this is anticipated to be part of an upcoming release soon. Of course, you can align both reads separately. In this context it may be useful to know that you can specify multiple "--GERALD" options to goat_pipeline.py simultaneously to have more than one alignment performed with different config files.

Document generated by Confluence on Apr 05, 2007 11:25