This page last changed on Aug 15, 2007 by maising.

Overview

This document gives a brief overview of how to use the Genome Analyzer data analysis pipeline. For an introduction to the pipeline, see Pipeline documentation. For help on the installation, see Pipeline installation.

The analysis pipeline consists of three main modules:

  1. Image analysis (Firecrest)
  2. Base-caller (Bustard)
  3. Sequence alignment (GERALD/Eland)

For each of these modules, an analysis run will create a new subfolder in the run-folder (see Run-Folder specification). The analysis is run through the "make" utility which is commonly used for the compilation of software projects. Make serves as a wrapper around the core analysis modules, but all sub-modules can be called directly and sophisticated users may wish to create their own wrappers. Performance-critical modules are written in C++, most other modules in Perl and Python.

The subdirectories and Makefiles for the three core modules are created by three different scripts:

  1. Pipeline/Goat/goat_pipeline.py (image analysis)
  2. Pipeline/Goat/bustard.py (base-calling)
  3. Pipeline/Gerald/GERALD.pl (sequence alignment)

Any of the first two scripts can invoke the next scripts automatically, so there is no need for a user to call more than one script for any given analysis run. Of course, typically the goat_pipeline.py script would be the script that needs to be run, but the other two scripts enable the user to reanalyse data at any given stage in the process, for example using different parameters or a new pipeline version.

A run of the pipeline is essentially a two-stage process:

  1. Generate the folders and Makefiles using one of the above scripts.
  2. Start the pipeline by executing "make".

Parameters

For an optimal analysis run, a number of calibration and input parameters are needed by the pipeline. By default, the pipeline auto-calibrates these calibration parameters. However, if desired the user can manually specify these parameters, in most cases in the form of a parameter file. It is useful to understand the meaning of these parameters:

  1. Instrument calibration parameters
    1. Default offsets: There are small pixel offsets between the four images in different colours taken of each tile. These are due to slightly different optical paths for each image. Additionally, the pipeline also corrects for linear rescalings of the image.
    2. Frequency cross-talk matrix: The Genome Analyzer uses two different lasers to excite four dyes attached to the four nucleotides. The frequency emission of these four dyes overlaps and the four images are not independent. Like for Sanger sequencing, the frequency cross-talk has to be deconvolved using a frequency cross-talk matrix.
    3. Phasing estimates: Depending on the efficiency of the fluidics and the sequencing reactions, a small number of molecules in each cluster may run ahead of or fall behind of the current incorporation cycle. This effect, which we call "phasing" (molecules falling behind the incorporation cycle) and "prephasing" (molecules running ahead), can be mitigated by applying a correction during the base-calling step.
  2. Sample information
    1. Reference genome etc: This information is necessary to do the alignment of the sequences to a reference.

Default offsets are needed for the image analysis, frequency matrix and phasing estimates for the base-calling and the sample information for the alignment. All instrument-specific calibration parameters are estimated automatically during a pipeline run.

A more detailed description of the parameters is given in Pipeline calibration parameters.

Parameters and initial setup

As most of the parameters get auto-detected during a run, there is not much need to know about them unless something goes wrong with the run. In most cases, you can probably skip the following sections and move straight on to "Basic usage". However, it is important to note the following points:

  • For the first run of the pipeline, you need to create a folder as described in the next section. (If you forget to create the folder, the pipeline will create one automatically.)
  • For samples that do not contain a mostly random base composition (for example expression data), the auto-calibration of the phasing and matrix is not reliable. At the moment this means that you will need to dedicate a lane on your flowcell to running a genomic control sample. You then want to use the "--control-lane=<n>" options as described in Goat documentation - Firecrest, Bustard.

Directory setup

Default offsets and frequency matrices are usually fairly constant for runs on the same instrument and the user may not need to change them frequently. The pipeline provides a mechanism for specifying these parameters automatically.

Create a directory called

Instruments/<instrument_name>

for each Genome Analyzer you are analysing data from, in the same directory in which the run-folder is located. <instrument_name> is the hostname of the computer that is attached to the instrument, and on which the instrument control software runs on. For example, the directory for the run-folder "/data/070813_ILMN-1_0217_FC1234" would be called "/data/Instruments/ILMN-1/". The pipeline will place a file called "default_offsets.txt" into this directory; the pipeline automatically keeps this file up-to-date.

The default location of the Instruments directory can be overwritten by using the environment variable "INSTRUMENT_DIR", as in

export INSTRUMENT_DIR=/home/user/Instruments

If no instrument directory exists, the pipeline will auto-create one for you.

Sample information

See GERALD User Guide and FAQ for more information on configuration files for the sequence alignment. In the absence of a mechanism to auto-generate config files from sample-sheet information, this part of the analysis run setup is the most laborious, and the easiest to get wrong. Gerald now performs some up-front checking of the sanity of the parameters.

Basic usage

Goat

A typical analysis run would be started as follows:

Pipeline/Goat/goat_pipeline.py [--cycles=<firstcycle>-<lastcycle>|auto] [--control-lane=<n>]
   [--GERALD=<configfile>] [--make] <run-folder> [<run-folder2>]

An example:

Pipeline/Goat/goat_pipeline.py --cycles=1-27 --GERALD=/data/070813_ILMN-1_0217_FC1234/config.txt /data/070813_ILMN-1_0217_FC1234

would run a sanity check on the run-folder /data/070813_ILMN-1_0217_FC1234 and report all detected folders and parameter files as well as fill in any missing configuration options. Hwoever, calling the script would not actually create any Makefiles unless the "--make" option is specified. This is useful for checking that the run-folder is well-formed and all input files are found by the pipeline.

Adding the "--make" option to the command would create an analysis for the run-folder /data/0070813_ILMN-1_0217_FC1234. If the "--GERALD" option is specified, Gerald will also start a brief "make" run to test the consistency of the supplied config file. Example:

Pipeline/Goat/goat_pipeline.py --cycles=1-27 --GERALD=/data/070813_ILMN-1_0217_FC1234/config.txt --make /data/070813_ILMN-1_0217_FC1234

The actual analysis run can the be started by changing into the newly created directory "/data/070813_ILMN-1_0217_FC1234/Data/C1-26_Firecrest..." and starting

make recursive

or

make recursive -j n

to parallelise the analysis run to <n> different processes - however, this relies on your operating system or setup to work efficiently and support automatic load-sharing to multiple CPUs. For more information on pipeline parallelisation see Pipeline parallelisation. The Unix "nohup" command may be useful to redirect the pipeline standard output and keep the "make" process alive even if your terminal is killed:

nohup make recursive -j n &

More information on further make targets (for example to compress data, or to analyse subsets of the data) and command-line options to scripts (for example to auto-generate offsets) can be found in Goat documentation - Firecrest, Bustard.

Bustard

If base-calling is to be repeated, for example in order to assign a control lane for more accurate matrix and phasing estimation, a script called bustard.py can be called taking the Firecrest folder as an argument:

Pipeline/Goat/bustard.py --control-lane=4 --make /data/070813_ILMN-1_0217_FC1234/Data/C1-26_Firecrest*

and type again

make recursive

in the newly generated Bustard folder.

For more information on how to run goat_pipeline.py and bustard.py see Goat documentation - Firecrest, Bustard.

Gerald

Gerald can be run without the rest of the pipeline. A typical invocation would be

Pipeline/Gerald/GERALD.pl  gerald_config.txt --EXPT_DIR path_to_bustard_folder --FORCE

For more information on config files and GERALD.pl see GERALD User Guide and FAQ. Please also note the difference between the possible short-read alignment programs that Gerald can use: PhageAlign for exhaustive alignments of short genomes (<1 Mb), and Eland for a fast alignment to large genomes with up to two mismatches.

Read pairs

The pipeline can handle reads generated from paired end runs. As part of the analysis, it will:

  • Remap all clusters across both runs to the first cycle of the first read.
  • Reset the matrix and matrix calculation after the end of the first read.
  • Generate sequence strings whose combined length is equal to the sum of the lengths of each individual read.

There is a GERALD mode called "eland_pair" which will allow Eland to align paired reads.

The simplest way to use read-pair data assumes that you have a single run-folder containing the images for both reads, with a continuously incremented cycle count. From Galaxy version 1.0 anmd pipeline version 0.3 on, the pipeline will automatically know where the second read starts. For older versions of the instrument and analysis software, the pipeline option "--new-read-cycle" can be used to identify the start of the second read.

An alternative way assumes that both reads of a pair are stored in two separate run folders. Then you simply specify both folders as arguments to "goat_pipeline.py". This will generate output only in the first run-folder - the second folder is not touched.

For paired reads it may also be useful to know that you can specify multiple "--GERALD" options to goat_pipeline.py simultaneously to have more than one alignment performed with different config files.

Reporting problems

In case of any problems with the analysis pipeline, it usually helps to capture all the output and error messages produced by a run. This can be done by simply redirecting the output, using nohup or the facilities of a cluster management system.

The appropriate forum to post problem reports to is the email address "techsupport@illumina.com".

It usually helps to attach the Makefile corresponding to the part of the pipeline that was believed to be causing the problems. If there are Gerald related issues, it usually helps to post the config.txt file found in the GERALD output folder. For problems relating to specific tiles/files it may also be useful to send the output of "wc -l" and "ls -l" on these files.

Document generated by Confluence on Jan 11, 2008 15:41