Data analysis - documentation : Pipeline usage
This page last changed on Jul 25, 2008 by mzerara.
OverviewThis document gives a brief overview of how to use the Genome Analyzer data analysis pipeline. For an introduction to the pipeline, see Pipeline documentation. For help on the installation, see Pipeline installation. The analysis pipeline consists of three main modules:
For each of these modules, an analysis run will create a new subfolder in the run-folder (see Run-Folder specification). The analysis is run through the "make" utility which is commonly used for the compilation of software projects. Make serves as a wrapper around the core analysis modules, but all sub-modules can be called directly and sophisticated users may wish to create their own wrappers. Performance-critical modules are written in C++, most other modules in Perl and Python. The subdirectories and Makefiles for the three core modules are created by three different scripts:
Any of the first two scripts can invoke the next scripts automatically, so there is no need for a user to call more than one script for any given analysis run. Of course, typically the goat_pipeline.py script would be the script that needs to be run, but the other two scripts enable the user to reanalyse data at any given stage in the process, for example using different parameters or a new pipeline version. A run of the pipeline is essentially a two-stage process:
ParametersFor an optimal analysis run, a number of calibration and input parameters are needed by the pipeline. By default, the pipeline auto-calibrates these calibration parameters. However, if desired the user can manually specify these parameters, in most cases in the form of a parameter file. It is useful to understand the meaning of these parameters:
Default offsets are needed for the image analysis, frequency matrix and phasing estimates for the base-calling and the sample information for the alignment. All instrument-specific calibration parameters are estimated automatically during a pipeline run. A more detailed description of the parameters is given in Pipeline calibration parameters.
Directory setupDefault offsets and frequency matrices are usually fairly constant for runs on the same instrument and the user may not need to change them frequently. The pipeline provides a mechanism for specifying these parameters automatically. Create a directory called Instruments/<instrument_name> for each Genome Analyzer you are analysing data from, in the same directory in which the run-folder is located. <instrument_name> is the hostname of the computer that is attached to the instrument, and on which the instrument control software runs on. For example, the directory for the run-folder "/data/070813_ILMN-1_0217_FC1234" would be called "/data/Instruments/ILMN-1/". The pipeline will place a file called "default_offsets.txt" into this directory; the pipeline automatically keeps this file up-to-date. The default location of the Instruments directory can be overwritten by using the environment variable "INSTRUMENT_DIR", as in export INSTRUMENT_DIR=/home/user/Instruments If no instrument directory exists, the pipeline will auto-create one for you. Sample informationSee GERALD User Guide and FAQ for more information on configuration files for the sequence alignment. In the absence of a mechanism to auto-generate config files from sample-sheet information, this part of the analysis run setup is the most laborious, and the easiest to get wrong. Gerald now performs some up-front checking of the sanity of the parameters. Basic usageGoatA typical analysis run would be started as follows: Pipeline/Goat/goat_pipeline.py [--cycles=<firstcycle>-<lastcycle>|auto] [--control-lane=<n>] [--GERALD=<configfile>] [--make] <run-folder> [<run-folder2>] An example: Pipeline/Goat/goat_pipeline.py --cycles=1-27 --GERALD=/data/070813_ILMN-1_0217_FC1234/config.txt /data/070813_ILMN-1_0217_FC1234 would run a sanity check on the run-folder /data/070813_ILMN-1_0217_FC1234 and report all detected folders and parameter files as well as fill in any missing configuration options. Hwoever, calling the script would not actually create any Makefiles unless the "--make" option is specified. This is useful for checking that the run-folder is well-formed and all input files are found by the pipeline. Adding the "- Pipeline/Goat/goat_pipeline.py --cycles=1-27 --GERALD=/data/070813_ILMN-1_0217_FC1234/config.txt --make /data/070813_ILMN-1_0217_FC1234 The actual analysis run can the be started by changing into the newly created directory "/data/070813_ILMN-1_0217_FC1234/Data/C1-26_Firecrest..." and starting make recursive or make recursive -j n to parallelise the analysis run to <n> different processes - however, this relies on your operating system or setup to work efficiently and support automatic load-sharing to multiple CPUs. For more information on pipeline parallelisation see Pipeline parallelisation. The Unix "nohup" command may be useful to redirect the pipeline standard output and keep the "make" process alive even if your terminal is killed: nohup make recursive -j n & More information on further make targets (for example to compress data, or to analyse subsets of the data) and command-line options to scripts (for example to auto-generate offsets) can be found in Goat documentation - Firecrest, Bustard. IPAR analysisWith version 1.0 of the pipeline, an additional option is to skip the image analysis and continue analysis straight from the output of an IPAR module attached to the instrument. A typical IPAR 1.0 folder may be called "/data/070813_ILMN-1_0217_FC1234/Data/IPAR_1.0" (the IPAR version number in the folder name will be the version number of the IPAR system generating the analysis). In order to continue analysis from the IPAR output, simply specify the IPAR folder as an argument to goat_pipeline.py (instead of the run-folder): goat_pipeline.py <IPAR folder> An example: goat_pipeline.py /data/070813_ILMN-1_0217_FC1234/Data/IPAR_1.0. If only the run-folder is specified, the pipeline will fall back to its old behaviour and restart the image analysis. BustardIf base-calling is to be repeated, for example in order to assign a control lane for more accurate matrix and phasing estimation, a script called bustard.py can be called taking the Firecrest folder as an argument: Pipeline/Goat/bustard.py --control-lane=4 --make /data/070813_ILMN-1_0217_FC1234/Data/C1-26_Firecrest* and type again make recursive in the newly generated Bustard folder. For more information on how to run goat_pipeline.py and bustard.py see Goat documentation - Firecrest, Bustard. GeraldGerald can be run without the rest of the pipeline. A typical invocation would be Pipeline/Gerald/GERALD.pl gerald_config.txt --EXPT_DIR path_to_bustard_folder --FORCE For more information on config files and GERALD.pl see GERALD User Guide and FAQ. The default compression format for sig2.txt files is gzip. Therefore, if one needs to use uncompressed sig2.txt files (or another compression format), the "--compression <none|gzip|bzip2>" must be added to the command line arguments. Please also note the difference between the possible short-read alignment programs that Gerald can use: PhageAlign for exhaustive alignments of short genomes (<1 Mb), and Eland for a fast alignment to large genomes with up to two mismatches. Read pairsThe pipeline can handle reads generated from paired end runs. As part of the analysis, it will:
There is a GERALD mode called "eland_pair" which will allow Eland to align paired reads. The simplest way to use read-pair data assumes that you have a single run-folder containing the images for both reads, with a continuously incremented cycle count. From Galaxy version 1.0 anmd pipeline version 0.3 on, the pipeline will automatically know where the second read starts. For older versions of the instrument and analysis software, the pipeline option "--new-read-cycle" can be used to identify the start of the second read. An alternative way assumes that both reads of a pair are stored in two separate run folders. Then you simply specify both folders as arguments to "goat_pipeline.py". This will generate output only in the first run-folder - the second folder is not touched. For paired reads it may also be useful to know that you can specify multiple "--GERALD" options to goat_pipeline.py simultaneously to have more than one alignment performed with different config files.
|
![]() |
Document generated by Confluence on Jul 25, 2008 16:42 |