This page last changed on Mar 28, 2008 by maising.
This document describes the basic usage of the "GOAT pipeline", a part of the data analysis pipeline that is a wrapper for the Firecrest image analysis software and the Bustard base-caller.
Although several different software programs are involved in an analysis run, a single command can be used to start the whole pipeline automatically, from the images saved by the instruments up to base-called sequences and probability estimates.
The pipeline operates in a specific directory structure that the images and analysis output files are saved in. This run-folder specification is described in detail the document Run-Folder Specification. Output files produced by the analysis software are saved in a hierarchical directory structure. Each run of one of the main analysis modules creates a subdirectory in the run-folder. For example, each run of the image analysis software will create a new image analysis output folder. Each run of the base-calling software will create a new subdirectory in the image analysis subdirectory that the base calls are based on, resulting in a tree-like structure of analyses. Parameters and versions for any given analysis run are logged in the folder structure in order to make it possible to reconstruct any previous analysis run.
The modular structure should make it easy to replace specific modules of the pipeline with alternative or more recent implementations. The whole analysis is centred around dependencies between input and output files (for example, in order to produce a sequence file, an intensity file with the output of the image analysis is required) and is based on the Unix "make" utility, which is designed to model dependency trees by specifying dependency rules for files. The actual implementation uses make extensions that may only be available in the GNU version of make. The use of make also makes parallelisation easy because the parallel forking of processes described by make rules is one of the features of many make implementations.
The installation of the pipeline is described in Pipeline installation.
An overview of the pipeline usage is given in Pipeline usage.
Running the pipeline
The standard pipeline invocation assumes that the images are organised in a certain directory structure described by the Run-folder specification. Unlike the stand-alone Firecrest image analysis package, the rest of the sequencing pipeline does not have a graphical user interface (GUI). The general synopsis of a pipeline call is
/path/Pipeline/Goat/goat_pipeline.py [--cycles=1-25|auto] [--tiles=s_1,s_2_0003,...]
[--matrix=mymatrix.txt|auto|auto<n>] [--offsets=/path/default_offsets.txt|auto]
[--phasing=0.01|auto|auto<n>] [--prephasing=0.01]
[--directory=/path/C1-14_Firecrest1.8.20_01-08-2006_user] [--make]
[--GERALD=/path/config.txt] [--control-lane=5]
<run-folder-directory>|<IPAR directory> [<run-folder-directory2>]
Some of the arguments above have example values displayed. The only compulsory argument is the path to the run-folder that is supposed to be analysed, or alternatively any folder containing tiff images that are to be analysed, or just a list of filenames for tiff files.
Example:
Execute the "goat_pipeline.py" script by typing
/path/Pipeline/Goat/goat_pipeline.py /data/070813_ILMN-1_0217_FC1234
This will scan all the image folders in the run-folder "/data/070813_ILMN-1_0217_FC1234" and print diagnostic output about the images and parameter files found and what directories an analysis run would have created. No files or directories have been modified on the data drive at this stage.
To generate an actual run, use the "--make" option and start the analysis by typing "make":
/path/Pipeline/Goat/goat_pipeline.py --make /data/070813_ILMN-1_0217_FC1234
The output files containing data statistics and histograms can now be found in the directory "/data/070813_ILMN-1_0217_FC1234/Data/C1-27_Firecrest1.9.0_23-08-2007-user" or similar, where 1.9.0 is the version of Firecrest used, 23 Aug 2007 is the date of the analysis and "user" is your computer login. A new output directory is created each time you rerun the analysis, so there is no need to remove any type of previous analysis.
Instead of a run-folder one can also specify to a full IPAR folder that was produced by an IPAR module attached to the instrument. The advantage is that no further image analysis is needed in this case, and the analysis will run more quickly. In this case, no new folder is created; instead, the additional pipeline output can be found inside the IPAR folder and subfolders thereof.
The Bustard base-caller
The base-caller can be invoked repeatedly without rerunning the image analysis. This may be useful to rerun the base-caller with different parameter settings. The run is set up by a separate script called "bustard.py". Synopsis:
/path/Pipeline/Goat/bustard.py
[--cycles=1-25|auto] [--tiles=s_1,s_2_0003,...]
[--matrix=mymatrix.txt|auto|auto<n>]
[--phasing=0.01|auto|auto<n>] [--prephasing=0.01] [--make]
[--GERALD=/path/config.txt] [--control-lane=<lane>]
<Firecrest directory>,
for example
/path/Pipeline/Goat/ bustard.py --control-lane=5
/data/070813_ILMN-1_0217_FC1234/Data/C1-27_Firecrest1.9.0_23-08-2007-user
Again, this will not actually generate any Makefile and directories unless the "--make" option has been specified.
Full list of command line options
The goat_pipeline.py and bustard.py scripts can be invoked with a number of optional command-line arguments.
General options
- --make: Creates the analysis directory and a "Makefile" in the relevant analysis directory. You can then start the analysis by changing to the directory and typing "make". For more make options see below. If this option is omitted, the pipeline will not write any information to your run-folder.
- --new-read-cycle=<cycle>: Indicates the start of a new read in a paired end run. The calculation of the matrix correction and the application of the phasing correction will be reset for the part of the read after the given cycle.
- --GERALD=<config.txt>: Start the Gerald Makefile generator after the completion of the Bustard run, using the Gerald config file "config.txt". See the documentation GERALD User Guide and FAQ for more details on Gerald and the format of the config file. Multiple Gerald files can be specified by repeating the option with different config filenames.
- --tiles=<tile>|<lane>[,<tile>|<lane>,...]: Select individual tiles for analysis with "--tiles=s_1,s_2_01,s_3_0001,s_5_0002" etc. to select all tiles in channel 1, all tiles starting in "01" in lane 2, and positions 1 in channel 3 and position 2 in channel 5. Another option is "--tiles=_0010,_0020" to select only tiles 10 and 20 of every lane.
- --cycles=<cycle>[-<cycle>[,<cycle>[-<cycle>...]]]: Select cycles for analysis. Examples: Use "--cycles=1,3-5,7,9-11" to include only cycles 1,3,4,5,7,9,10,11 in the analysis. Note that skipping cycles in the middle of a read usually means that ou cannot align the data with Eland anymore. The value "auto" means that the pipeline automatically selects the lowest number of cycles present in any of the tiles (to make sure that all tiles have equal read-lengths, irrespective of the state of data acquistion/mirroring).
- --compression=<method>: Use compression to reduce the size of the Firecrest output. Allowed values are "none", "gzip" (the default) and "bzip2".
goat_pipeline.py options
- --nobasecall: Skips the base-calling step in the analysis.
- --offsets=<filename>|auto|default: Unfortunately, images obtained in different colours have some small pixel offsets. See Pipeline calibration parameters for a more detailed discussion. An offset file can be specified using the option "--offsets=<filename>". If no offset file is specified, the pipeline will try to create such a file: If the value of this parameter is "auto" (this is the default behaviour), the Makefile will run a short initial analysis to determine the offsets automatically and then start the proper analysis run. If the value is "default", the pipeline will attempt to identify a suitable default_offsets.txt file and not do any auto-calibration.
goat_pipeline.py and bustard.py options
- --control-lane=<n>: Use lane <n> to estimate phasing and matrix correction for all other lanes. This option is a synonym for "--phasing=auto<n> --matrix=auto<n>". Control lanes are necessary for samples with skewed base compositions, for example for many tag-based applications.
- --matrix=<filename>|auto|auto<n>|lane: Specify a file that holds the frequency cross-talk matrix to be used, where filename refers to the path of the matrix file. If no matrix is specified, or if the value is "auto" (the default behaviour), the pipeline will auto-generate the matrix. A value of "auto<n>", where <n> is a lane number between 1 and 8, is analogous to the "--phasing=auto<n>" option and allows the matrix estimation to be derived from only one lane. Compare the section on read-pairs for advanced usage for paired end data.
- --phasing=<x>|auto|auto<n>: Apply a phasing correction. If the value of "auto" is given (the default behaviour), the pipeline will auto-generate the phasing and prephasing values. A value of "auto<n>", where <n> is a lane number between 1 and 8, will use the automated phasing estimates from the corresponding lane; this is useful for samples with an uneven base composition (for example in gene expression), for which the current phasing estimator does not work reliably and phasing needs to be estimated from a single control lane. A phasing value can be specified directly, for example "--phasing=0.01" for a correction for phasing with a rate of 1% per cycle (i.e. 1% of molecules in a cluster fall behind the other molecules). In this case, the option is normally combined with the "--prephasing" option. Compare the section on read-pairs for advanced usage for paired end data.
- --prephasing=<x>: Apply a "prephasing" correction; example: "--prephasing=0.01" for a correction for prephasing with a prephasing rate of 1% per cycle. Normally this option should never be used; by default the pipeline autogenerates phasing estimates.. There is no "--prephasing=auto", use "--phasing=auto" instead.
Unsupported options
- --directory=<Firecrest folder>: Continue a previous image analysis run and add a few more cycles of data. The old analysis run is given by the argument <Firecrest folder>, for example "/data/070813_ILMN-1_0217_FC1234/Data/C1-10_Firecrest1.9.0_23-08-2007-user". This folder will be renamed to a new name based on the total number of cycles and the current date and user. Care needs to be taken that the previous analysis has finished successfully before the new folder is created, otherwise the resulting folder will be corrupted.
Read pairs
See Pipeline usage for more information on read pairs. The following additional variations on the goat_pipeline.py/bustard.py options are supported for read pairs:
- --phasing=<read>:value, --phasing=<read>:<read>: Specify phasing options for one specific read of a pair. Examples:
--phasing=1:auto --phasing=2:auto5
to use the default phasing option for read 1 (in the example, this option is of course redundant), but base the phasing estimates for read 2 on lane 5 only.
to use the phasing estimate for the second read and apply it to both read 1 and read 2.
--phasing=1:2 --phasing=2:auto6
to use the phasing estimate for lane 6 of the second read and apply it to both read 1 and read 2.
- --matrix=<read>:value, --matrix=<read>:<read>: Specify matrix options for one specific read of a pair. This is anologous the the phasing options above.
Makefiles
Both goat_pipeline.py and bustard.py scripts generate Makefiles in the relevant image analysis and base-caller directories that allow the complete analysis to be run by GNU Make. The Makefiles have the advantages:
- Not all of the analysis needs to be run immediately.
- On a multiprocessor system or cluster, the analysis can easily be parallelised by specifying the "-j" option for make.
- In case of any failure or interruption during an analysis run, the run can easily be restarted at the last point.
The Makefiles recognise the following targets:
- all: The default target, runs the complete analysis to be done in the current directory (i.e. the image analysis or the base-caller).
- {sample}_{lane}: Analyse all tiles in a lane (e.g. "make s_1 s_2" to make lanes 1 and 2.
- {sample}_{lane}_{tileindex}: Analyse a specific tile only ("make s_1_0007"). This target is currently incompatible with auto-generated matrices and phasing estimates.
- _{tileindex}: Analyse all tiles with the given index from any lane (useful to analyse randomly chosen subsets of a tile; e.g. "make _0020 _0040 _0060" to analyse tiles with indices 20, 40, 60, 80 in all lanes. This target is currently incompatible with auto-generated matrices and phasing estimates.
- clean: Remove all analysis output files.
- recursive: Do the analysis in the current directory and in all available subdirectories. This can be used to start a complete end-to-end analysis run from image analysis to sequence alignment using a single command. Targets can be specified by setting the TARGET environment variable. Unfortunately the recursive option is not compatible with tile and lane-specific targets. Examples:
- make recursive: Start recursive full analysis.
- make recursive TARGET=clean: Remove all analysis results from ALL subfolders (use with care!)
- compress: Use gzip to compress some of the output files (after an analysis run). This reduces the size of the analysis folders signficantly (typically the Firecest and Bustard folders are reduced to around 1/3 and 1/4 of their original size). In the compressed state, not further analysis can be run on the folder; in order to reanalyse, it has to be uncompressed first.
- uncompress: Uncompress a folder that has previously been compressed and return it to its original state (the compression is lossless).
- compress_images (only available in the Firecrest folder): Use bzip2 to compress the image data. This can take a signficant amount of time, but reduces the size of the Images directory to about 60% of its original size. No further image analysis can be run on the compressed folders until "uncompress_images" has been run.
- ucompress_images (only available in the Firecrest folder): Uncompress previously compressed images.
- clean_intermediate: Remove all redundant analysis files in subdirectories.
|