This page last changed on Mar 09, 2007 by maising.

This document describes the basic usage of the "GOAT pipeline", a part of the data analysis pipeline that is a wrapper for the Firecrest image analysis software and the Bustard base-caller.

Although several different software programs are involved in an analysis run, a single command can be used to start the whole pipeline automatically, from the images saved by the instruments up to base-called sequences and probability estimates.

The pipeline operates in a specific directory structure that the images and analysis output files are saved in. This run-folder specification is described in detail the document Run-Folder Specification. Output files produced by the analysis software are saved in a hierarchical directory structure. Each run of one of the main analysis modules creates a subdirectory in the run-folder. For example, each run of the image analysis software will create a new image analysis output folder. Each run of the base-calling software will create a new subdirectory in the image analysis subdirectory that the base calls are based on, resulting in a tree-like structure of analyses. Parameters and versions for any given analysis run are logged in the folder structure in order to make it possible to reconstruct any previous analysis run.

The modular structure should make it easy to replace specific modules of the pipeline with alternative or more recent implementations. The whole analysis is centred around dependencies between input and output files (for example, in order to produce a sequence file, an intensity file with the output of the image analysis is required) and is based on the Unix "make" utility, which is designed to model dependency trees by specifying dependency rules for files. The actual implementation uses make extensions that may only be available in the GNU version of make. The use of make also makes parallelisation easy because the parallel forking of processes described by make rules is one of the features of many make implementations.

The installation of the pipeline is described in Pipeline installation.

An overview of the pipeline usage is given in Pipeline usage.

Running the pipeline

Running the pipeline requires some basic knowledge of the DOS or Unix operating systems. As mentioned above, the standard pipeline invocation assumes that the images are organised in a certain directory structure described by the run-folder specification. (The pipeline can also deal with certain other configurations, such as manual sequencing output produced from the Manteia instruments. These configurations are not actively supported.) Unlike the stand-alone Firecrest image analysis package, the rest of the sequencing pipeline does not have a graphical user interface (GUI). The general synopsis of a pipeline call is

/path/Pipeline/Goat/goat_pipeline.py [--cycles=1-25|auto] [--tiles=s_1,s_2_0003,...]
   [--matrix=mymatrix.txt|auto|auto<n>] [--offsets=/path/default_offsets.txt|auto]
   [--phasing=0.01|auto|auto<n>] [--prephasing=0.01]
   [--directory=/path/C1-14_Firecrest1.8.20_01-08-2006_user] [--run] [--make]
   [--GERALD=/path/config.txt] run-folder-directory [run-folder-directory2 ]

Some of the arguments above have example values displayed. The only compulsory argument is the path to the run-folder that is supposed to be analysed, or alternatively any folder containing tiff images that are to be analysed, or just a list of filenames for tiff files.

Example:

Execute the "goat_pipeline.py" script by typing

/home/user/Pipeline/Goat/goat_pipeline.py /data/00002887

This will scan all the image folders in the run-folder "/data/00002887" and print diagnostic output about the images and parameter files found and what directories an analysis run would have created. No files or directories have been modified on the data drive at this stage.

To generate an actual run, use the "--make" option and start the analysis by typing "make":

/home/user/Pipeline/Goat/goat_pipeline.py --make /data/00002887

The output files containing data statistics and histograms can now be found in the directory "/data/00002887/Data/C1-27_Firecrest1.8.20_23-06-2006-user" or similar, where 1.8.20 is the version of Firecrest used, 23 May 2006 is the date of the analysis and "user" is your computer login. A new output directory is created each time you rerun the analysis, so there is no need to remove any type of previous analysis.

Read pairs

The Goat part of the pipeline supports read pairs. See Pipeline usage for more information.

Command line options

The goat_pipeline.py script can be invoked with a number of optional command-line arguments:

  • --nobasecall: Skips the base-calling step in the analysis.
  • --tiles=<tile>|<lane>[,<tile>|<lane>,...]: Select individual tiles for analysis with "--tiles=s_1,s_3_0001,s_5_0002" etc. to select all tiles in channel 1, and positions 1 in channel 3 and position 2 in channel 5.
  • --cycles=<cycle>[-<cycle>[,<cycle>[-<cycle>...]]]: Select cycles for analysis. Examples: Use "--cycles=1,3-5,7,9-11" to include only cycles 1,3,4,5,7,9,10,11 in the analysis. "goat_pipeline.py --tiles=s_1 --cycles=1-8 ." to analyse all images from channel 1 up to cycle 8. The value "auto" means that the pipeline automatically selects the lowest number of cycles present in any of the tiles (to make sure that all tiles have equal read-lengths, irrespective of the state of data acquistion/mirroring).
  • --matrix=<filename>: Specify a file that holds the frequency cross-talk matrix to be used, where filename refers to the path of the matrix file. If no matrix is specified, or if the value is "auto", the pipeline will auto-generate the matrix. Alternatively, the value of "default" indicates that the the location of the matrix that is assumed by Phoenix is in /path/Instruments/InstrumentName/matrix.txt, where "path" is the path in which the run-folder directory lives (e.g. "/data" in the example "/data/00002887") and "InstrumentName" is the hostname of the sequencing instrument the experiment was performed on (compare the "--offsets" option). A value of "auto<n>", where <n> is a lane number between 1 and 8, is analogous to the "--phasing=auto<n>" option and allows the matrix estimation to be derived from only one lane. (However we'd be interested to hear about systematic issues with the matrix estimation that may prompt you to use this option.)
  • --offsets=<filename>|auto: Unfortunately, images obtained in different colours have some small pixel offsets. The remapping routines will automatically correct for those on later cycles, but in the first cycle, when no previous images can be used as a reference, the offsets have to be subtracted manually. This can be done by providing a file called "default_offsets.txt" in the "Data" directory, or in the matrix default directory described in the previous paragraph. This files contains the x-, y-offsets to be applied to the images in the first cycle in 4 rows of 2 x, y pixel offsets and 2 x, y scale factors for 4 different colours. Alternatively, a different offset file can be specified using the option "--offsets=filename". If no offset file is specified, the pipeline will try to create such a file: If the value of this parameter is "auto", the Makefile will run a short initial analysis to determine the offsets automatically and then start the proper analysis run. This is recommended only for instrument recalibration, as default offsets calculated from a subset of the tiles are not expected to be more accurate than offsets calculated from all tiles of a previous run.
  • --phasing=<x>: Apply a phasing correction; example: "--phasing=0.01" for a correction for phasing with a rate of 1% per cycle (i.e. 1% of molecules in a cluster fall behind the other molecules). If no value or the value of "auto" is given, the pipeline will auto-generate the phasing and prephasing values; no prephasing value need to be given in this case. A value of "auto<n>", where <n> is a lane number between 1 and 8, will use the automated phasing estimates from the corresponding lane; this is useful for samples with an uneven base composition (for example in gene expression), for which the current phasing estimator does not work reliably and phasing needs to be estimated from a single control lane.
  • --prephasing=<x>: Apply a "prephasing" correction; example: "--prephasing=0.01" for a correction for prephasing with a prephasing rate of 1% per cycle. There is no "--prephasing=auto", use "--phasing=auto" instead. By default the pipeline autogenerates phasing estimates.
  • --reset-phasing-cycle=<cycle no>: Restart the phasing correction in the given cycle. This option can be used more than once per run. The option is useful for an analysis of runs combined from different sequencing primers, for example in the case of paired reads.
  • --make: Creates the analysis directory and a "Makefile" in the relevant analysis directory. You can then start the analysis by changing to the directory and typing "make". For more make options see below. If this option is omitted, the pipeline will not write any information to your run-folder.
  • --GERALD=<config.txt>: Start the Gerald Makefile generator after the completion of the run, using the Gerald config file "config.txt". This is best combined with the "--make" option. See the documentation GERALD User Guide and FAQ for more details on Gerald and the format of the config file. Multiple Gerald files can be specified by repeating the option with different config filenames.

Obsolete/unsupported options

  • --directory=/data/00002559/Data/C2-26_Phoenix1.8.8_11-10-2005_user: Continue a previous image analysis run and add a few more cycles of data. The old analysis run is given by the argument "/data/00002559/Data/C2-26_Phoenix1.8.8_11-10-2005_user". This folder will be renamed to a new name based on the total number of cycles and the current date and user. If the instrument run has not finished and more data may be added during the pipeline run, it is recommended that you use this option together with the "--cycles" option to prevent the pipeline from trying to analyse half-finished cycles. The first cycle to specify for this option is the first cycle of the whole run, not just of the newly added data. Note: This option is incompatible with the "make" framework, and it is not actively supported at the moment - it may not even work anymore.
  • --sequence-directory=/data/00002559/Data/C2-26_Phoenix1.8.8_11-10-2005_user/Bustard1.8.8_08-11-2005_user: Almost identical to the previous option and can be used to specify the directory where the sequence output of the previous analysis run is kept (the previous command simply uses the latest run of the base-caller as a default, otherwise the effect of the options is identical). Note: This option is incompatible with the "make" framework, and it is not actively supported at the moment - it may not even work anymore.
  • --run: Starts an analysis run directly from within the script, without reference to the Makefile. This configuration is not fully maintained anymore, as it it not parallelisable.

The Bustard base-caller

The base-caller can be invoked repeatedly without rerunning the image analysis. This may be useful to rerun the base-caller with different parameter settings. The run is set up by a separate script called "bustard.py". Synopsis:

Pipeline/Goat/bustard.py [--matrix=<filename>] [--run] [--make] [--phasing=<fraction>|auto] 
  [--prephasing=<fraction2>] [--tiles=<tilelist>] [--GERALD=<config file>] <analysis-directory>,

for example "bustard.py --matrix=matrix.txt --phasing=0.01 --prephasing=0.01 --tiles=s_1,s_3 /data/00002559/Data/C2-26_Firecrest1.8.20_01-08-2006_user". Again, this will not actually generate any Makefile and directories unless the "--make" option has been specified.

Makefiles

Both goat_pipeline.py and bustard.py scripts generate Makefiles in the relevant image analysis and base-caller directories that allow the complete analysis to be run by GNU Make. The Makefiles have the advantages:

  • Not all of the analysis needs to be run immediately.
  • On a multiprocessor system or cluster, the analysis can easily be parallelised by specifying the "-j" option for make.
  • In case of any failure or interruption during an analysis run, the run can easily be restarted at the last point.

The Makefiles recognise the following targets:

  • all: The default target, runs the complete analysis to be done in the current directory (i.e. the image analysis or the base-caller).
  • {sample}_{lane}: Analyse all tiles in a lane (e.g. "make s_1 s_2" to make lanes 1 and 2.
  • {sample}_{lane}_{tileindex}: Analyse a specific tile only ("make s_1_0007"). This target is currently incompatible with auto-generated matrices and phasing estimates.
  • _{tileindex}: Analyse all tiles with the given index from any lane (useful to analyse randomly chosen subsets of a tile; e.g. "make _0020 _0040 _0060" to analyse tiles with indices 20, 40, 60, 80 in all lanes. This target is currently incompatible with auto-generated matrices and phasing estimates.
  • clean: Remove all analysis output files.
  • recursive: Do the analysis in the current directory and in all available subdirectories. This can be used to start a complete end-to-end analysis run from image analysis to sequence alignment using a single command. Targets can be specified by setting the TARGET environment variable. Examples:
    • make recursive: Start recursive full analysis.
    • make recursive TARGET=clean: Remove all analysis results from ALL subfolders (use with care!)
    • make recursive TARGET="s_1 s_2": Complete analysis for lanes 1 and two (note that multiple targets have to be enclosed in quotation marks).
  • compress: Use gzip to compress some of the output files (after an analysis run). This reduces the size of the analysis folders signficantly (typically the Firecest and Bustard folders are reduced to around 1/3 and 1/4 of their original size). In the compressed state, not further analysis can be run on the folder; in order to reanalyse, it has to be uncompressed first.
  • uncompress: Uncompress a folder that has previously been compressed and return it to its original state (the compression is lossless).
  • compress_images (only available in the Firecrest folder): Use bzip2 to compress the image data. This can take a signficant amount of time, but reduces the size of the Images directory to about 60% of its original size. No further image analysis can be run on the compressed folders until "uncompress_images" has been run.
  • ucompress_images (only available in the Firecrest folder): Uncompress previously compressed images.
  • clean_intermediate: Remove all redundant analysis files in subdirectories.

Notes

A few restrictions on the image files are imposed by the software:

  • You can have an arbitrary number of enzymology and deblocking cycles in your experiment, but for each position on the chip you need to have the same number of colours in each cycle. This is only relevant for instrument testing where less than 4 colours may be acquired.
  • The first cycle has to be an enzymology cycle, not a deblocking cycle.
  • No two deblocking cycles may follow each other.
Document generated by Confluence on Apr 05, 2007 11:25