This page last changed on Aug 17, 2007 by maising.

This page keeps track of analysis pipeline (and Firecrest/Bustard) releases. Versions are tagged in cvs using the string "Versionx_y_z", where "x.y.z" is the version number.

Pipeline:
Latest release is 0.2.2.5

Pipeline development version

Firecrest GUI version:
Latest release is 1.8.23

Firecrest/Bustard production pipeline version
Latest release is 1.8.28

Firecrest/Bustard development version:
Latest version is 1.9.0

Version 0.3beta1 (Firecrest 1.9.0)

Firecrest

  • high-density high-throughput optimisation
    • improved high-density analysis
  • Cluster size estimator
  • Make --offsets=auto the default behaviour, should eliminate all the offset related ways to mess up a pipeline run
  • IPAR interface implemenation
  • Full read-pair support
  • disk storage reduced by a factor of 3 by use of compressed intensity files

Bustard

  • Read pair support
    • Reset matrices for second read of a read-pair (
    • Phasing estimate for second pair of a read
      • individual automatic phasing estimation for both reads
  • new more robust matrix estimator

Gerald

  • Eland: multiple-entry fast-A files
  • read-pair support
    • multiple alignment Eland
    • new file formats
    • easier specification of USE_BASES, READ_LENGTH
  • Clean up graphs/summary reports
    • clean up signal files
    • web pages:
      • jerboa without alignments
      • new summary stats
      • paired end stats
    • remove obsolete graphs
    • reduce the number of thumbnails per html page to avoid browser crashes

Robustness:

  • Report errors in pipes, redirections and multiple commands
  • more consistent error messages

Various fixes:

  • handle obsolete .params XML files
  • Handle relative paths for GENOME_FILE consistently

Version 0.2.2.6

A bug-fix/robustness improvement release.

  • Fix a problem with running 'ANALYSIS eland' on reads of more than 32 bases: the start positions of alignments to the reverse strand were shifted by a constant offset (equal to the read length minus 32). Alignments to the forward strand are unaffected. This is the most important problem fixed by this release.
  • Make Gerald compatible with make-3.81.
  • Avoid corrupted intensity files for images with certain contamination features (very rare case).
  • Quality recalibration is performed on a per-lane basis instead of a per-tile basis.
  • Significantly reduced memory footprint of the image analysis.
  • Use a more representative set of tiles is used for the "--offsets=auto" option.
  • The "--tiles" option now accepts a mixture of lane/tile selections, as originally described in the documentation.
  • Prevent goat_pipeline.py from overriding the READ_LENGTH config file setting when clipping the last cycle.
  • Support row/column image filenames.

Version 0.2.2.5

  • Fix bug in "ANALYSIS sequence" that caused empty output files
  • Handle corrupted images gracefully in first cycle
  • Handle corrupted qcm.xml files better
  • Handle empty quality calibration files gracefully.
  • Avoid buffer overruns in Gerald "make clean".

Version 0.2.2.4

Various bug fixes:

  • Eland output conversion
  • >32-base Eland support in goat_pipeline.py
  • goat_pipeline.py: "--offsets-auto" option
  • Firecrest: proper formatting of output files.
  • Resolve potential conflict with GZIP and BZIP2 environment variables

Version 0.2.2.2-0.2.2.3

Various bug fixes.

Version 0.2.2 (Firecrest/Bustard 1.8.28)

  • Eland:
    • Read lengths of >32 bases supported. Any read length can now be used for an Eland analysis. (The actual alignment only makes use of the first 32 bases; this also means that in later cycles an arbitrary number of errors is allowed.)
    • ELAND_MULTIPLE_INSTANCES now supports values of 1, 2, 4, 8.
  • Bustard: Base-calling is now faster by a factor of 4-5. This is based on an analytical solution for the 4-dimensional integration of the intensity likelihood. It also gives more discimatory power for the high-quality bases. For the time being, the maximum score given out by the base-caller is (arbitrarily restricted to) Q40. These scores are uncalibrated, and we will investigate how useful the better dynamic range is in practice.
  • Bustard/Firecrest: Storage requirements for Firecrest/Bustard analysis are reduced by around 50%. This is achieved by removing the intermediate output files and makes the "make clean_intermediate" target obsolete. It also reduces the file IO for Firecrest and Bustard significantly.
  • Goat:
    • new option "--matrix=auto<n>" (analogous to the equivalent phasing option) to use matrix estimates from a single lane.
    • Support for paired reads. Two run-folders can be specified as arguments to goat_pipeline.py. The clusters will be remapped to the first run-folder, the output sequences concatenated and the phasing correction reset appropriately.
    • More robust matrix scaling.
  • Others:
    • New experimental hooks and flags to facilitate custom parallelisation - see Pipeline parallelisation for details.
    • "ANALYSIS sequence" produces intensity graphs and corresponding entries in the Summary page.
    • The "sig.txt" files have been removed.
    • The "qhg.txt" files have moved from the GERALD folder to the Bustard folder.
    • Some more catches for corrupted images.

Version 0.2.1 (Firecrest/Bustard 1.8.27)

  • Bustard base-caller:
    • Integration routines have been rewritten to help compilers with the optimisation. On my development computer this speeds up the base-calling by almost 50%. Your benefits may depend on compiler and architecture.
      full feature in the next release).
  • Firecrest: Handle corrupted image files gracefully (first cycle still not handled)
  • Matrix estimation: Speed optimisations
  • Goat:
    • New version of the "-phasing" option: "-phasing=auto<n>", where <n> is a lane number between 1 and 8, sets the automated phasing estimation to use data from this lane only. This is useful for samples with unbalanced base compositions (such as gene expression tag sequencing), for which the current phasing estimator is known to be unreliable. The phasing can be estimated from a single control lane.
    • Reset phasing calculation for paired reads (experimental).
  • Gerald:
    • sequence.txt files are now filtered.
    • Eland analysis should now appear properly in the summary sheet.
    • The filtering scheme has been modified. The new test for SIMILARITY (similar signal behaviour between adjacent clusters) is designed to remove spurious 'double detections' of the same cluster (present in ~ 0.5% of the data).
    • A new config file option: Define "ELAND_MULTIPLE_INSTANCES 8" in your config file to allow Eland to analyse 8 lanes in parallel. Warning: Note that this requires up to 8 GB of RAM in total, and may crash your analysis if not enough RAM is available (default Eland RAM usage is ~1GB). Do not use this unless you know you've got enough RAM (and if you are using a cluster, be aware that your load-sharing software may not monitor your RAM usage). NB: "ELAND_MULTIPLE_INSTANCES 4" or similar is not supported at the moment; this will still use 8 GB of RAM.

Note: Eland alignment is still limited to 32 bases. We are very much aware of this constraint; unfortunately the fix is non-trivial. We hope to have a work-around available by the end of February. As a work-around, it is possible to script around this limitation by aligning the first 32 bases of a longer read and attaching the missing bases; no such script is part of thee pipeline yet.

Version 0.2.0.2 (Firecrest/Bustard 1.8.26)

  • Fix SIMILARITY filtering parameter.
  • Eliminate sequences with variable read-lengths from Eland alignment files.
  • Add ".exe" to cygwin executables.

Version 0.2.0 (Firecrest/Bustard 1.8.26)

This release contains several major new features.

  • Base-call calibration: A new set of utilties to improve the base-call calibration has been added. This is essentially a reimplementation of the phred paper. Many thanks go to Pablo Alvarez and Will Brockman at the Broad who pioneered this approach on Solexa data. At the moment the pipeline performs a self-calibration, calculating a phred table on each lane and then applying it to the same lane. This is obviously suboptimal for several reasons and will need to be improved in future releases. It's really just a starting point for future improvements. Please see the new documentation page about base-call calibration for more details.
  • High-throughput analysis. The image analysis has been optimised to accommodate much higher cluster densities than previously. This version of the pipeline has been used to analyse our first 1G runs internally. 20,000 purity filtered clusters per tile are achievable with this pipeline release. Analysing 1G of data brings a new set of challenges:
    • The total amount of raw read data is much more than 1G, as the percentage of purity filtered clusters decreases. This is not a flaw in the system, but a consequence of the Poisson statistics of cluster overlaps. In fact, at the theoretically optimal density only about 37% of the clusters are expected to survive purity filtering.
    • Using phageAlign on a 1G run-folder is a Bad Idea, unless you have excessive amounts of CPU power at your disposal. If you only use a single analysis computer, you could easily spend weeks on the alignments. Eland is the way of the future here - it only takes a few hours.
    • Scaling up by such a large factor from our previous typical run density automatically exposes a new set of bottlenecks in the pipeline. For example, image analysis time does not depend strongly on cluster density and becomes less important in the time budget. We opted for releasing the improvements before eliminating all the new bottlenecks - we expect to address those in the next releases. One major limitation at the moment is the fact that Eland can only align up to 32 bases - this will also be addressed shortly. (There was a time in Solexa's single molecule history when 32 bases seemed like a lot - who would ever want to do more than 32 bases? But then there was a time when 64 kB looked like a lot of RAM.)
    • Optimisations for small dense clusters may not have any effect on large sparse clusters.
  • The phasing estimator performs some purity filtering on its input data; this will hopefully reduce the bias of the estimator towards low phasing values that was encountered for high-density chips.
  • default_offsets files should now be generated automatically if absent.
  • ADVANCE NOTICE of possible future changes to the pipeline:
    • The qalign files may disappear completely. If you have any good reason why you need these, please speak up now.
    • The maximum quality score that the base-caller produces is currently 30. This value is essentially arbitrary and may change (or disappear).

Version 0.1.9.3 (Firecrest/Bustard 1.8.24; unreleased)

Again, some minor bug fixes/improvements only.

  • no editing of Firecrest Makefile for 64-bit architectures required
  • "--matrix=auto" legacy option works properly again
  • Gerald handles relative path names to Gerald executable
  • filtering works on less than 12 cycles
  • "make selftest" fixed for hard-wired genome directory
  • Eland builds under make-3.81
  • Several updates to the documentation

Version 0.1.9.2 (Firecrest/Bustard 1.8.23)

Minor update release - fixes a few small bugs in 0.1.9. The goal is to make 0.1.9 very stable, before we go on to make larger changes for 0.2.0.

  • Fix truncated POST_RUN_COMMAND
  • Allow single-cycle analysis.
  • Make "--deblock" option work again - at least for Firecrest.
  • goat_pipeline.py fixed to recognise the "expression" target.

Version 0.1.9 (Firecrest/Bustard 1.8.22)

  • The pipeline is now compatible with the qmake utility provided by Sun Grid Engine(SGE) 6.0. On the Chesterford cluster, scaling is pretty much linear with the number of nodes. Turn-around time for double-column (200 tiles per lane) experiments on 7 dual-processor, dual-core nodes is around 3-4h/runfolder. Changes:
    • Makefiles now compatible with make 3.78 (on which qmake is based)
    • Removed multi-line make commands; multiple commands for a single target have to be on a single line
  • Improved image remapping. Images are rescaled for an improved cluster coordinate remapping:
    • default_offsets.txt files now contain four columns: x offset, y offset, x scale, y scale. The pipeline will not make use of improved accuracy the very first time the new version is run for a particular instrument, as the default_offsets.txt would only be updated after the run (so it would follow the old behaviour of version 0.1.8). The second run will make use of the new coordinates.
    • This mostly eliminates the problem of erroneous double detections (when cluster positions observed in different colour channels do not line up). It also improves the pass rate of the purity filters for high cluster densities.
  • Usability enhancements:
    • The goat_pipeline.py options "-matrix=auto" and "-phasing=auto" are now defaults and do not need to be specified anymore.
    • Can specifiy multiple "--GERALD" options to "goat_pipeline.py" to allow multiple alignments to be generated in different GERALD folders.
    • The pipeline refuses to run when no offsets are specified, and it can autogenerate the Instruments folder. The location of the instrument folder is now configurable via the environment variable "INSTRUMENT_DIR" or the goat_pipeline.py option "-INSTRUMENT_DIR=...". The "-offsets=auto" option now works properly in parallelised mode (however, this option is for initial calibration only, it should not be used as a standard procedure).
    • If Gerald is invoked through the "goat_pipeline.py" script, the parameters READ_LENGTH and USE_BASES are automatically updated to reflect the actual read-length analysed by the image analysis (this should eliminate a set of usability problems where READ_LENGTH and the length of the USE_BASES string did not match).
    • The "goat_pipeline.py" script takes a "--cycles=auto" option, which automatically determines the maximal number of cycles that can be analysed given the current state of the run-folder. This is useful to start off an analysis on an incomplete run-folder (e.g. because data acquisition/mirroring is still ongoing). Together with the previous option, this means that the GERALD config.txt file does not need to be modified.
    • The Scipy/Numpy dependency has been removed. Matrix estimation is now a C++ program and thus much faster.
    • Various bottlenecks encountered during the analysis of large run-folders have been eliminated.
  • Known issues: phageAlign is not suitable for large genome sizes, and never will be. Use Eland instead (Whole genome alignments using Eland).

Pipeline version 0.1.8 (Firecrest/Bustard 1.8.21)

  • Firecrest: Eliminate unnecessary FFT transforms from image alignment. This speeds up the image analysis by 15-20%.
  • Started to eliminate references to the numpy/scipy libraries. scipy is still required in two files in the matrix estimator (Firecrest); this will be removed in the next release. Bustard no longer uses scipy. Due to the rewrite in C++, phasing estimation is now faster by a factor of 100 or so.
  • Gerald: Improved error checking for lane-specific analysis.
  • New make targets "compress" and "uncompress" to reduce the size of the data folder (typically less than 50% of the original size; "compress_images" and "uncompress_images" can be used to reduce the size of the Images folder to about 60% of its original size). The compression is lossless, but no further analysis can be done in the compressed state.
  • Updated the analyse_image utility (this can be used for single-image batch processing). Output should now be identical to the GUI version.
  • Added hook for FocusQuality utility (not currently part of the core pipeline).
  • Various minor bug fixes and updates.

Pipeline version 0.1.7 (Firecrest/Bustard 1.8.20)

  • Fully automated matrix estimation; use the "--matrix=auto" option.
  • Fully automated phasing estimation; use the "--phasing=auto" option.
  • Various improvement to output messages, more error checking on parameter input.
  • Fixed (extremely rare) floating point exception in linear algebra module
  • Matrix estimation fully parallelisable to 8 lanes
  • Removed "lsmake" incompatible make syntax
  • Save offset statistics
  • Minimal support for row/column image filenames.

The automated phasing and matrix estimation is currently incompatible with the make targets for individual tiles.

Firecrest/Bustard release notes v1.8.19

  • Fix buffer overrun caused by imprecise floating point arithmetics in Firecrest.
  • Improved handling of out-of-focus images when phasing-correction is active.
  • Phasing estimation is now part of the pipeline.
  • Slimmed down Makefile.
  • "--norun" option has been removed - it is now the default.

Firecrest/Bustard release notes v1.8.18

  • Fix bug in Bustard: for a phasing-corrected run some non-detections could be counted as errors
  • Fix performance bottleneck for unusual image sizes.
  • Fix crash of matrix estimator on poor data.

Firecrest/Bustard release notes v1.8.17

  • Add matrix estimation to the default pipeline. A new subdirectory called "Matrix" is created which contains the matrices estimated for the first two sequencing cycles. Currently the matrices do not get used in the further analysis.
  • Fix issue for image sizes that are not multiples of 8.
  • Add new option to run GERALD from the goat_pipeline.py script: "--GERALD=config.txt", where config.txt is the name of the Gerald config file.
  • Rename "Phoenix" to "Firecrest". Also rename various source code modules and variables to remove references to the old name. Move into new cvs module.

Firecrest/Bustard release notes v1.8.16

  • cygwin version gets linked to optimised FFT libraries again
  • "nodeblock" option to goat_pipeline.py removed - "nodeblocked" is now the default option. If you want to include deblock images, specify "--deblock".
  • Slightly more descriptive output for corrupted image files.
  • "nobasecall" option works again.
  • A number of additional make targets has been introduced to facilitate the make analysis. See also the pipeline documentation in the wiki.
    • "make recursive": Start analysis in current directory and all subdirectories. This way it is possible to do both Phoenix and Bustard from one command. (And the GERALD makefile gets executed too if there is one; this feature is untested). See the full documentation to see how to combine the "recursive" option with more complex targets.
    • "make clean1" - remove all files produced by analysis of lane 1. Similar for other lanes.
    • "make s_1_0003" - short-cut for "make s_1_0003_int.txt" or "make s_1_0003_seq.txt".
    • "make _0003" - analyse all images taken at the third tile index for any lane. This is convenient if you want to run some quick preliminary analysis on a subset of tiles from each lane.

Firecrest/Bustard release notes v1.8.14

  • Collect all image analysis output functions in one code section.
  • Hold full image analysis data objects until threading is complete.
  • Remove unneeded temporary arrays.
  • Move base-calling functionality to separate (deprecated) method.

This version should make it easier to customise the output functions and may also reduce the risk of IO related errors.
A drawback is that the memory footprint is increased (still less than 150 MB).

Firecrest/Bustard release notes v1.8.13

  • Fix division by zero bug in full cluster statistics mode
  • make version of pipeline now updates offsets
  • make version should be more robust with respect to restart: no intermediate targets anymore
  • qc files automatically overwritten on restart

Firecrest/Bustard release notes v1.8.12

  • Bustard accepts clusters without tiles
  • Remove obsolete output (hdr.txt files)
  • Clean up main program to make merging GCPipe
  • Formatting and white-space changes

Firecrest/Bustard release notes v1.8.11

  • Prephasing correction now possible in the base-caller.
  • Removed old "--efficiency" switch from the pipeline. New switches:
    • "--phasing=x", where x is the phasing rate (usually around 0.01)
    • "--prephasing=x", where x is the prephasing rate (usually around 0.01)
    • There is currently no mechanism to estimate phasing rates automatically.
  • Decoupled image analysis and base-calling completely.
    • This is to prepare for the introduction of autogenerated matrices and should also reduce the number of unnecessary base-caller runs.
    • New pipeline switch: "--nobasecall" do not run base caller at all
    • The base-caller can always be run at a later stage using the bustard.py script.
  • The pipeline now autogenerates makefiles.
    • This should be useful for parallelising the analysis using the "-j" make option, or by just running some lanes/tiles at a time. Might also be useful in case of failure.
    • makefiles are now generated on every invocation of the pipeline, but not normally executed.
    • New pipeline switch: "--make" do not run any image analysis/base-calling at all, but generate all the folders, makefiles and prerequisites required for a subsequent "make" command.
    • Image analysis and base-calling need to be "made" separately (and for obvious reasons the image analysis needs to be run first).
    • Useful make targets:
      • "all": analyse all lanes and tiles
      • "s_1": analyse all tiles in lane 1
  • New command line option:
    • "--threshold=x" set the cluster detection threshold. Default is 12. It is not recommended to actually use this option.
Document generated by Confluence on Dec 19, 2007 18:32