This page last changed on Dec 11, 2006 by maising.

There is a new software module in the pipeline: Pipeline/QCalibration.

This is essentially a reimplementation of the phred paper (Ewing and Green, Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities, 1998, Genome Research, 8, 186). The calibration framework uses a set of trace- and base-specific parameters that are indicative of the base-call quality to predict a new calibrated quality score. The mapping between old and new scores is encoded in a ("phred") table. The mapping is calibrated on a set of known alignments. At the moment the pipeline performs the calibration procedure as a post-processing step to the base-calling, and it uses each lane both as separate training set and as the target data. Despite this auto-calibration (e.g. training set and target data are the same) in the current pipeline version, the underlying programs could easily be used for cross-calibration.

The main goal is to bring the quality scores (and corresponding predicted error rates) that the Solexa base-caller ("Bustard") generates into line with the error rates obtained from an alignment. This obviously is based on the assumption that the alignments are more or less correct; violations of that assumption will skew the calibration and limit the accuracy of the calibration. For example, if the sample in question contains a contaminant (e.g. Ecoli sequence) at the 5% level, the corresponding reads may - depending on read-length and target genome - be mistaken as aligned reads and contribute significantly to the error rates. A reference obtained for a different individual would limit the maximum quality to the rate at which SNPs are observed - the quality scores cannot get better than the SNP rate. The maximum alignment score is also limited by the size of the data set. For example, in order to obtain Q40, one would need around 10^4 base-pairs at a quality of Q40 (presumably even more because of the Poisson conunting error). For more information on alignment related issues, see Alignment Scoring Guide and FAQ.

The original Bustard base-caller scores are essentially based on an error propagation of the estimated noise on the underlying raw cluster intensities; the methodology is described in more detail elsewhere. We have opted to generate 4 quality scores rather than just one, given the relative probabilities that the underlying read is an A, C, G or T. In order to accommodate the 4 different reads and obtain a sensible dynamic range, we have modified the phred formula slightly to generate a new set of quality scores that approaches the phred definition asymptotically for Q > 10-15 (Alignment Scoring Guide and FAQ). These scores have been shown to

  1. approximate the alignment quality scores reasonably (in many cases) for Q < 10,
  2. be monotonic with respect to error rates in most cases,
  3. provide additional value even for the non-called base.

Unfortunately, the calibration may be far from perfect, particularly for high quality scores Q >> 10. At the moment their dynamic range is also limited to Q=30; this is of course an arbitrary cut-off (motivated by 1. the rising CPU time it would take in the current implementation to go beyond Q30 and 2. the fact that the quality estimation procedure used by the base-caller is unlikely to be accurate beyond Q30 anyway). The aim of the recalibration is to improve the correlation of the scores with the alignment error rates and ideally also to extend their range beyond Q30.

Usage

The calibration is part of the Solexa pipeline from version 0.2.0 onwards and will be run automatically as part of the genomic analysis.

If you want to run the software directly, there are three utilities that you need to calibrate existing quality scores.

  1. extractQualityParams.py
  2. QualityCalibration
  3. QualityApply

Extraction of parameters

The first step is a simple Python script to extract the relevant parameters (and training alignments) from the run-folder. At the moment, this script is used for self-calibration and requires both the parameters and the alignments to be present. Of course, at some stage it may be desirable to use the parameters only and apply a table derived from a different data set.

The parameters that are currently in use are inspired by work done by Pablo Alvarez and Will Brockman:

  1. cycle number (currently used in the realign files)
  2. purity scores (currently used for filtering, here on a per-base basis)
  3. purity scores (currently used for filtering, over PURE_BASES bases)
  4. old Solexa quality score (currently used in qalign files)

The usage of the quality extractor is:

Pipeline/QCalibration/extractQualityParams.py s_5_0001_align.txt [--USE_BASES=nYYYYYYYYYY] \
  [--PURE_BASES=12] [--sig-suffix=sig2] [--dir=<bustardfolder>] s_5_0001_qval.txt 

Instead of align files, you could also use prealign or qalign files (see What do the different files in an analysis directory mean). The alignment file has to be in a proper run-folder as the script will look for prb and sig files in a directory above the current one unless you specify the "--dir" option. The USE_BASES and PURE_BASES options need to follow the same format used for the alignments (see GERALD User Guide and FAQ).

The qval file has the following format: Each row stands for one base. Separated by tabs there are columns for each quality parameter. The last two columns contain the called base, and the corresponding base from the alignment. If no alignment is available, a '.' is used.

Creation of a quality table

Pipeline/QCalibration/QualityCalibration s_5_0001_qval.txt > s_5_0001_qtable.txt

You can specify an arbitrary number of qval files. The qval file could have more than 4 parameters. The calibration routines are agnostic as to the number of parameters, so it would be easy to use a different set of parameters with a different extractor script.

NB.: The last sentence is not quite true yet; the only place the number of parameters is hard-coded is the the file QualityCalibration.cpp. This is needed for identifying a suitable binning of parameters. This could easily be changed to

  1. auto-generate the binning, or
  2. read the binning from a config file

Generating the new scores

Pipeline/QCalibration/QualityApply s_5_0001_qtable.txt s_5_0001_qval.txt > s_5_0001_qprb.txt

Observations

  • Best achievable Q score depends on total number of data points (because of the 1+... term in the version of the phred formula implemented in the paper).
  • The quality of alignments in training data set is also crucial.
  • The parameters used may not be strictly monotonic predictors for data quality. At least there are cases when subsets of the data actually have higher quality scores than earlier rules.

TODO

  • Read bin sizes from config file and/or auto-generate even binning.
  • Use better rules for finding maxima than suggested in the phred paper - for example, the binning currently restricted to rounded values of the phred scores, and the number of bins in different dimensions may not be the same - so the "maximum sum of indices" criterion is not very useful (it's presumably not very useful in the original paper either, unless you can really make sure that all bins have similar occupancy numbers).
  • Sort tables.
  • Allow files in a different format when applying the calibration.

Acknowledgments

Many thanks go to Pablo Alvarez and Will Brockman, who pioneered the application of the phred scheme to Solexa data and suggested the parameters we are currently using.

Document generated by Confluence on Mar 09, 2007 16:11