This page last changed on Jan 11, 2008 by maising.

This page keeps track of analysis pipeline (and Firecrest/Bustard) releases. Versions are tagged in cvs using the string "Versionx_y_z", where "x.y.z" is the version number.

Pipeline:
Latest release is 0.3

Firecrest GUI version:
Latest release is 1.8.23

Firecrest/Bustard pipeline version
Latest release is 1.9.2

Version 0.3 (Firecrest/Bustard 1.9.2)

  • Bug fix for DOS-style line endings in fastA input files to squashGenome
  • New validation script to check for pipeline prerequisites

Version 0.3beta3 (Firecrest/Bustard 1.9.2)

This release fixes a number of bugs relative to beta2, most notably:

  • A regression which caused the positions on the reverse strand to be incorrectly counted for the eland mode. The new "eland_extended" and "eland_pair" modes were not affected. This bug was not present in 0.2.2.6.
  • A missing dependency in the quality calibration.

Version 0.3beta2 (Firecrest/Bustard 1.9.1)

  • Many bug fixes, including:
    • Fix for various incompatibilities with recent versions of ImageMagick
    • "--cycles" option now works on PE reads
  • Additional statistics for the PE analysis in Summary.htm. See "docs/Summary.htm - more detail.htm"
  • Some of the newly introduced output formats have been revised slightly to make export to external tools easier. See "docs/New analysis modes for GERALD in Pipeline Version 0.3" for a description of the file formats.

Version 0.3beta1 (Firecrest/Bustard 1.9.0)

Major new features:

  • Support for paired end analysis. For PE runs with two run-folders you need to specify both runfolders as arguments to goat_pipeline.py
    • Improved data quality for paired end runs through independent phasing and matrix correction on all reads. The estimation is independently configurable for each read.
    • A new Eland mode (eland_pair) allows the combined alignment of paired end reads and is able to exploit the typical insert length distribution to resolve ambiguities.
  • A new Eland mode (eland_extended) for improved analysis of long non-paired runs (the existing Eland mode is retained as is for backwards compatibility)
  • Optimised image analysis for higher data quality and yield on high-density runs. This has been observed to increase the yield by up to 10% or more on very dense runs.
  • Simplified usage of the Gerald config files. The READ_LENGTH variable is no longer necessary. Wildcards allow automatic expansion of the USE_BASES string - no need to produce long strings of identical characters. The consistency of the input is checked up front.
  • squashGenome can handle multi-entry fastA files.
  • General robustness improvements, for example involving the matrix estimation and the overall robustness of the generated Makefiles. The need to specify default offsets should be eliminated. Summary html files restrict the number of thumbnails plotted to avoid overloading web browsers.
  • Compressed intensity files reduce disk usage of Firecrest output by a factor of up to 3.
    • New options "--compression=none", "--compress=gzip", "--compress=bzip2" to use compressed intensity files. Default is now set to gzip compression; use "--compression=none" if you want the old behaviour. bzip2 compression is sligntly more efficient in terms of disk usage, but considerably more costly in terms of cpu usage.
  • Experimental feature to auto-detect cluster sizes and optimise the image analysis accordingly.

Version 0.2.2.6

A bug-fix/robustness improvement release.

  • Fix a problem with running 'ANALYSIS eland' on reads of more than 32 bases: the start positions of alignments to the reverse strand were shifted by a constant offset (equal to the read length minus 32). Alignments to the forward strand are unaffected. This is the most important problem fixed by this release.
  • Make Gerald compatible with make-3.81.
  • Avoid corrupted intensity files for images with certain contamination features (very rare case).
  • Quality recalibration is performed on a per-lane basis instead of a per-tile basis.
  • Significantly reduced memory footprint of the image analysis.
  • Use a more representative set of tiles is used for the "--offsets=auto" option.
  • The "--tiles" option now accepts a mixture of lane/tile selections, as originally described in the documentation.
  • Prevent goat_pipeline.py from overriding the READ_LENGTH config file setting when clipping the last cycle.
  • Support row/column image filenames.

Version 0.2.2.5

  • Fix bug in "ANALYSIS sequence" that caused empty output files
  • Handle corrupted images gracefully in first cycle
  • Handle corrupted qcm.xml files better
  • Handle empty quality calibration files gracefully.
  • Avoid buffer overruns in Gerald "make clean".

Version 0.2.2.4

Various bug fixes:

  • Eland output conversion
  • >32-base Eland support in goat_pipeline.py
  • goat_pipeline.py: "--offsets-auto" option
  • Firecrest: proper formatting of output files.
  • Resolve potential conflict with GZIP and BZIP2 environment variables

Version 0.2.2.2-0.2.2.3

Various bug fixes.

Version 0.2.2 (Firecrest/Bustard 1.8.28)

  • Eland:
    • Read lengths of >32 bases supported. Any read length can now be used for an Eland analysis. (The actual alignment only makes use of the first 32 bases; this also means that in later cycles an arbitrary number of errors is allowed.)
    • ELAND_MULTIPLE_INSTANCES now supports values of 1, 2, 4, 8.
  • Bustard: Base-calling is now faster by a factor of 4-5. This is based on an analytical solution for the 4-dimensional integration of the intensity likelihood. It also gives more discimatory power for the high-quality bases. For the time being, the maximum score given out by the base-caller is (arbitrarily restricted to) Q40. These scores are uncalibrated, and we will investigate how useful the better dynamic range is in practice.
  • Bustard/Firecrest: Storage requirements for Firecrest/Bustard analysis are reduced by around 50%. This is achieved by removing the intermediate output files and makes the "make clean_intermediate" target obsolete. It also reduces the file IO for Firecrest and Bustard significantly.
  • Goat:
    • new option "--matrix=auto<n>" (analogous to the equivalent phasing option) to use matrix estimates from a single lane.
    • Support for paired reads. Two run-folders can be specified as arguments to goat_pipeline.py. The clusters will be remapped to the first run-folder, the output sequences concatenated and the phasing correction reset appropriately.
    • More robust matrix scaling.
  • Others:
    • New experimental hooks and flags to facilitate custom parallelisation - see Pipeline parallelisation for details.
    • "ANALYSIS sequence" produces intensity graphs and corresponding entries in the Summary page.
    • The "sig.txt" files have been removed.
    • The "qhg.txt" files have moved from the GERALD folder to the Bustard folder.
    • Some more catches for corrupted images.

Version 0.2.1 (Firecrest/Bustard 1.8.27)

  • Bustard base-caller:
    • Integration routines have been rewritten to help compilers with the optimisation. On my development computer this speeds up the base-calling by almost 50%. Your benefits may depend on compiler and architecture.
      full feature in the next release).
  • Firecrest: Handle corrupted image files gracefully (first cycle still not handled)
  • Matrix estimation: Speed optimisations
  • Goat:
    • New version of the "-phasing" option: "-phasing=auto<n>", where <n> is a lane number between 1 and 8, sets the automated phasing estimation to use data from this lane only. This is useful for samples with unbalanced base compositions (such as gene expression tag sequencing), for which the current phasing estimator is known to be unreliable. The phasing can be estimated from a single control lane.
    • Reset phasing calculation for paired reads (experimental).
  • Gerald:
    • sequence.txt files are now filtered.
    • Eland analysis should now appear properly in the summary sheet.
    • The filtering scheme has been modified. The new test for SIMILARITY (similar signal behaviour between adjacent clusters) is designed to remove spurious 'double detections' of the same cluster (present in ~ 0.5% of the data).
    • A new config file option: Define "ELAND_MULTIPLE_INSTANCES 8" in your config file to allow Eland to analyse 8 lanes in parallel. Warning: Note that this requires up to 8 GB of RAM in total, and may crash your analysis if not enough RAM is available (default Eland RAM usage is ~1GB). Do not use this unless you know you've got enough RAM (and if you are using a cluster, be aware that your load-sharing software may not monitor your RAM usage). NB: "ELAND_MULTIPLE_INSTANCES 4" or similar is not supported at the moment; this will still use 8 GB of RAM.

Note: Eland alignment is still limited to 32 bases. We are very much aware of this constraint; unfortunately the fix is non-trivial. We hope to have a work-around available by the end of February. As a work-around, it is possible to script around this limitation by aligning the first 32 bases of a longer read and attaching the missing bases; no such script is part of thee pipeline yet.

Version 0.2.0.2 (Firecrest/Bustard 1.8.26)

  • Fix SIMILARITY filtering parameter.
  • Eliminate sequences with variable read-lengths from Eland alignment files.
  • Add ".exe" to cygwin executables.

Version 0.2.0 (Firecrest/Bustard 1.8.26)

This release contains several major new features.

  • Base-call calibration: A new set of utilties to improve the base-call calibration has been added. This is essentially a reimplementation of the phred paper. Many thanks go to Pablo Alvarez and Will Brockman at the Broad who pioneered this approach on Solexa data. At the moment the pipeline performs a self-calibration, calculating a phred table on each lane and then applying it to the same lane. This is obviously suboptimal for several reasons and will need to be improved in future releases. It's really just a starting point for future improvements. Please see the new documentation page about base-call calibration for more details.
  • High-throughput analysis. The image analysis has been optimised to accommodate much higher cluster densities than previously. This version of the pipeline has been used to analyse our first 1G runs internally. 20,000 purity filtered clusters per tile are achievable with this pipeline release. Analysing 1G of data brings a new set of challenges:
    • The total amount of raw read data is much more than 1G, as the percentage of purity filtered clusters decreases. This is not a flaw in the system, but a consequence of the Poisson statistics of cluster overlaps. In fact, at the theoretically optimal density only about 37% of the clusters are expected to survive purity filtering.
    • Using phageAlign on a 1G run-folder is a Bad Idea, unless you have excessive amounts of CPU power at your disposal. If you only use a single analysis computer, you could easily spend weeks on the alignments. Eland is the way of the future here - it only takes a few hours.
    • Scaling up by such a large factor from our previous typical run density automatically exposes a new set of bottlenecks in the pipeline. For example, image analysis time does not depend strongly on cluster density and becomes less important in the time budget. We opted for releasing the improvements before eliminating all the new bottlenecks - we expect to address those in the next releases. One major limitation at the moment is the fact that Eland can only align up to 32 bases - this will also be addressed shortly. (There was a time in Solexa's single molecule history when 32 bases seemed like a lot - who would ever want to do more than 32 bases? But then there was a time when 64 kB looked like a lot of RAM.)
    • Optimisations for small dense clusters may not have any effect on large sparse clusters.
  • The phasing estimator performs some purity filtering on its input data; this will hopefully reduce the bias of the estimator towards low phasing values that was encountered for high-density chips.
  • default_offsets files should now be generated automatically if absent.
  • ADVANCE NOTICE of possible future changes to the pipeline:
    • The qalign files may disappear completely. If you have any good reason why you need these, please speak up now.
    • The maximum quality score that the base-caller produces is currently 30. This value is essentially arbitrary and may change (or disappear).

Version 0.1.9.3 (Firecrest/Bustard 1.8.24; unreleased)

Again, some minor bug fixes/improvements only.

  • no editing of Firecrest Makefile for 64-bit architectures required
  • "--matrix=auto" legacy option works properly again
  • Gerald handles relative path names to Gerald executable
  • filtering works on less than 12 cycles
  • "make selftest" fixed for hard-wired genome directory
  • Eland builds under make-3.81
  • Several updates to the documentation

Version 0.1.9.2 (Firecrest/Bustard 1.8.23)

Minor update release - fixes a few small bugs in 0.1.9. The goal is to make 0.1.9 very stable, before we go on to make larger changes for 0.2.0.

  • Fix truncated POST_RUN_COMMAND
  • Allow single-cycle analysis.
  • Make "--deblock" option work again - at least for Firecrest.
  • goat_pipeline.py fixed to recognise the "expression" target.

Version 0.1.9 (Firecrest/Bustard 1.8.22)

  • The pipeline is now compatible with the qmake utility provided by Sun Grid Engine(SGE) 6.0. On the Chesterford cluster, scaling is pretty much linear with the number of nodes. Turn-around time for double-column (200 tiles per lane) experiments on 7 dual-processor, dual-core nodes is around 3-4h/runfolder. Changes:
    • Makefiles now compatible with make 3.78 (on which qmake is based)
    • Removed multi-line make commands; multiple commands for a single target have to be on a single line
  • Improved image remapping. Images are rescaled for an improved cluster coordinate remapping:
    • default_offsets.txt files now contain four columns: x offset, y offset, x scale, y scale. The pipeline will not make use of improved accuracy the very first time the new version is run for a particular instrument, as the default_offsets.txt would only be updated after the run (so it would follow the old behaviour of version 0.1.8). The second run will make use of the new coordinates.
    • This mostly eliminates the problem of erroneous double detections (when cluster positions observed in different colour channels do not line up). It also improves the pass rate of the purity filters for high cluster densities.
  • Usability enhancements:
    • The goat_pipeline.py options "-matrix=auto" and "-phasing=auto" are now defaults and do not need to be specified anymore.
    • Can specifiy multiple "--GERALD" options to "goat_pipeline.py" to allow multiple alignments to be generated in different GERALD folders.
    • The pipeline refuses to run when no offsets are specified, and it can autogenerate the Instruments folder. The location of the instrument folder is now configurable via the environment variable "INSTRUMENT_DIR" or the goat_pipeline.py option "-INSTRUMENT_DIR=...". The "-offsets=auto" option now works properly in parallelised mode (however, this option is for initial calibration only, it should not be used as a standard procedure).
    • If Gerald is invoked through the "goat_pipeline.py" script, the parameters READ_LENGTH and USE_BASES are automatically updated to reflect the actual read-length analysed by the image analysis (this should eliminate a set of usability problems where READ_LENGTH and the length of the USE_BASES string did not match).
    • The "goat_pipeline.py" script takes a "--cycles=auto" option, which automatically determines the maximal number of cycles that can be analysed given the current state of the run-folder. This is useful to start off an analysis on an incomplete run-folder (e.g. because data acquisition/mirroring is still ongoing). Together with the previous option, this means that the GERALD config.txt file does not need to be modified.
    • The Scipy/Numpy dependency has been removed. Matrix estimation is now a C++ program and thus much faster.
    • Various bottlenecks encountered during the analysis of large run-folders have been eliminated.
  • Known issues: phageAlign is not suitable for large genome sizes, and never will be. Use Eland instead (Whole genome alignments using Eland).

Pipeline version 0.1.8 (Firecrest/Bustard 1.8.21)

  • Firecrest: Eliminate unnecessary FFT transforms from image alignment. This speeds up the image analysis by 15-20%.
  • Started to eliminate references to the numpy/scipy libraries. scipy is still required in two files in the matrix estimator (Firecrest); this will be removed in the next release. Bustard no longer uses scipy. Due to the rewrite in C++, phasing estimation is now faster by a factor of 100 or so.
  • Gerald: Improved error checking for lane-specific analysis.
  • New make targets "compress" and "uncompress" to reduce the size of the data folder (typically less than 50% of the original size; "compress_images" and "uncompress_images" can be used to reduce the size of the Images folder to about 60% of its original size). The compression is lossless, but no further analysis can be done in the compressed state.
  • Updated the analyse_image utility (this can be used for single-image batch processing). Output should now be identical to the GUI version.
  • Added hook for FocusQuality utility (not currently part of the core pipeline).
  • Various minor bug fixes and updates.

Pipeline version 0.1.7 (Firecrest/Bustard 1.8.20)

  • Fully automated matrix estimation; use the "--matrix=auto" option.
  • Fully automated phasing estimation; use the "--phasing=auto" option.
  • Various improvement to output messages, more error checking on parameter input.
  • Fixed (extremely rare) floating point exception in linear algebra module
  • Matrix estimation fully parallelisable to 8 lanes
  • Removed "lsmake" incompatible make syntax
  • Save offset statistics
  • Minimal support for row/column image filenames.

The automated phasing and matrix estimation is currently incompatible with the make targets for individual tiles.

Firecrest/Bustard release notes v1.8.19

  • Fix buffer overrun caused by imprecise floating point arithmetics in Firecrest.
  • Improved handling of out-of-focus images when phasing-correction is active.
  • Phasing estimation is now part of the pipeline.
  • Slimmed down Makefile.
  • "--norun" option has been removed - it is now the default.

Firecrest/Bustard release notes v1.8.18

  • Fix bug in Bustard: for a phasing-corrected run some non-detections could be counted as errors
  • Fix performance bottleneck for unusual image sizes.
  • Fix crash of matrix estimator on poor data.

Firecrest/Bustard release notes v1.8.17

  • Add matrix estimation to the default pipeline. A new subdirectory called "Matrix" is created which contains the matrices estimated for the first two sequencing cycles. Currently the matrices do not get used in the further analysis.
  • Fix issue for image sizes that are not multiples of 8.
  • Add new option to run GERALD from the goat_pipeline.py script: "--GERALD=config.txt", where config.txt is the name of the Gerald config file.
  • Rename "Phoenix" to "Firecrest". Also rename various source code modules and variables to remove references to the old name. Move into new cvs module.

Firecrest/Bustard release notes v1.8.16

  • cygwin version gets linked to optimised FFT libraries again
  • "nodeblock" option to goat_pipeline.py removed - "nodeblocked" is now the default option. If you want to include deblock images, specify "--deblock".
  • Slightly more descriptive output for corrupted image files.
  • "nobasecall" option works again.
  • A number of additional make targets has been introduced to facilitate the make analysis. See also the pipeline documentation in the wiki.
    • "make recursive": Start analysis in current directory and all subdirectories. This way it is possible to do both Phoenix and Bustard from one command. (And the GERALD makefile gets executed too if there is one; this feature is untested). See the full documentation to see how to combine the "recursive" option with more complex targets.
    • "make clean1" - remove all files produced by analysis of lane 1. Similar for other lanes.
    • "make s_1_0003" - short-cut for "make s_1_0003_int.txt" or "make s_1_0003_seq.txt".
    • "make _0003" - analyse all images taken at the third tile index for any lane. This is convenient if you want to run some quick preliminary analysis on a subset of tiles from each lane.

Firecrest/Bustard release notes v1.8.14

  • Collect all image analysis output functions in one code section.
  • Hold full image analysis data objects until threading is complete.
  • Remove unneeded temporary arrays.
  • Move base-calling functionality to separate (deprecated) method.

This version should make it easier to customise the output functions and may also reduce the risk of IO related errors.
A drawback is that the memory footprint is increased (still less than 150 MB).

Firecrest/Bustard release notes v1.8.13

  • Fix division by zero bug in full cluster statistics mode
  • make version of pipeline now updates offsets
  • make version should be more robust with respect to restart: no intermediate targets anymore
  • qc files automatically overwritten on restart

Firecrest/Bustard release notes v1.8.12

  • Bustard accepts clusters without tiles
  • Remove obsolete output (hdr.txt files)
  • Clean up main program to make merging GCPipe
  • Formatting and white-space changes

Firecrest/Bustard release notes v1.8.11

  • Prephasing correction now possible in the base-caller.
  • Removed old "--efficiency" switch from the pipeline. New switches:
    • "--phasing=x", where x is the phasing rate (usually around 0.01)
    • "--prephasing=x", where x is the prephasing rate (usually around 0.01)
    • There is currently no mechanism to estimate phasing rates automatically.
  • Decoupled image analysis and base-calling completely.
    • This is to prepare for the introduction of autogenerated matrices and should also reduce the number of unnecessary base-caller runs.
    • New pipeline switch: "--nobasecall" do not run base caller at all
    • The base-caller can always be run at a later stage using the bustard.py script.
  • The pipeline now autogenerates makefiles.
    • This should be useful for parallelising the analysis using the "-j" make option, or by just running some lanes/tiles at a time. Might also be useful in case of failure.
    • makefiles are now generated on every invocation of the pipeline, but not normally executed.
    • New pipeline switch: "--make" do not run any image analysis/base-calling at all, but generate all the folders, makefiles and prerequisites required for a subsequent "make" command.
    • Image analysis and base-calling need to be "made" separately (and for obvious reasons the image analysis needs to be run first).
    • Useful make targets:
      • "all": analyse all lanes and tiles
      • "s_1": analyse all tiles in lane 1
  • New command line option:
    • "--threshold=x" set the cluster detection threshold. Default is 12. It is not recommended to actually use this option.
Document generated by Confluence on Jan 11, 2008 15:41