Data analysis - documentation : Base-call calibration
This page last changed on Jul 31, 2007 by rshaw.
There is a new software module in the pipeline: Pipeline/QCalibration. This is essentially a reimplementation of the phred paper (Ewing and Green, Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities, 1998, Genome Research, 8, 186). The calibration framework uses a set of trace- and base-specific parameters that are indicative of the base-call quality to predict a new calibrated quality score. The mapping between old and new scores is encoded in a ("phred") table. The mapping is calibrated on a set of known alignments. At the moment the pipeline performs the calibration procedure as a post-processing step to the base-calling, and it uses each lane both as separate training set and as the target data. Despite this auto-calibration (e.g. training set and target data are the same) in the current pipeline version, the underlying programs could easily be used for cross-calibration. The main goal is to bring the quality scores (and corresponding predicted error rates) that the Solexa base-caller ("Bustard") generates into line with the error rates obtained from an alignment. This obviously is based on the assumption that the alignments are more or less correct; violations of that assumption will skew the calibration and limit the accuracy of the calibration. For example, if the sample in question contains a contaminant (e.g. Ecoli sequence) at the 5% level, the corresponding reads may - depending on read-length and target genome - be mistaken as aligned reads and contribute significantly to the error rates. A reference obtained for a different individual would limit the maximum quality to the rate at which SNPs are observed - the quality scores cannot get better than the SNP rate. The maximum alignment score is also limited by the size of the data set. For example, in order to obtain Q40, one would need around 10^4 base-pairs at a quality of Q40 (presumably even more because of the Poisson conunting error). For more information on alignment related issues, see Alignment Scoring Guide and FAQ. The original Bustard base-caller scores are essentially based on an error propagation of the estimated noise on the underlying raw cluster intensities; the methodology is described in more detail elsewhere. We have opted to generate 4 quality scores rather than just one, given the relative probabilities that the underlying read is an A, C, G or T. In order to accommodate the 4 different reads and obtain a sensible dynamic range, we have modified the phred formula slightly to generate a new set of quality scores that approaches the phred definition asymptotically for Q > 10-15 (Alignment Scoring Guide and FAQ). These scores have been shown to
Unfortunately, the calibration may be far from perfect, particularly for high quality scores Q >> 10. At the moment their dynamic range is also limited to Q=30; this is of course an arbitrary cut-off (motivated by 1. the rising CPU time it would take in the current implementation to go beyond Q30 and 2. the fact that the quality estimation procedure used by the base-caller is unlikely to be accurate beyond Q30 anyway). The aim of the recalibration is to improve the correlation of the scores with the alignment error rates and ideally also to extend their range beyond Q30. UsageThe calibration is part of the Solexa pipeline from version 0.2.0 onwards and will be run automatically as part of the genomic analysis. If you want to run the software directly, there are three utilities that you need to calibrate existing quality scores.
Extraction of parametersThe first step is a simple Python script to extract the relevant parameters (and training alignments) from the run-folder. At the moment, this script is used for self-calibration and requires both the parameters and the alignments to be present. Of course, at some stage it may be desirable to use the parameters only and apply a table derived from a different data set. The parameters that are currently in use are inspired by work done by Pablo Alvarez and Will Brockman:
The usage of the quality extractor is: Pipeline/QCalibration/extractQualityParams.py s_5_0001_align.txt [--USE_BASES=nYYYYYYYYYY] \ [--PURE_BASES=12] [--sig-suffix=sig2] [--dir=<bustardfolder>] s_5_0001_qval.txt Instead of align files, you could also use prealign or qalign files (see What do the different files in an analysis directory mean). The alignment file has to be in a proper run-folder as the script will look for prb and sig files in a directory above the current one unless you specify the "--dir" option. The USE_BASES and PURE_BASES options need to follow the same format used for the alignments (see GERALD User Guide and FAQ). The qval file has the following format: Each row stands for one base. Separated by tabs there are columns for each quality parameter. The last two columns contain the called base, and the corresponding base from the alignment. If no alignment is available, a '.' is used. Creation of a quality tablePipeline/QCalibration/QualityCalibration Pipeline/QCalibration/QualityCalibration.xml s_5_0001_qval.txt > s_5_0001_qtable.txt You can specify an arbitrary number of qval files. The qval file could have more than 4 parameters. The calibration routines are agnostic as to the number of parameters, so it would be easy to use a different set of parameters with a different extractor script. (The only restriction is that QualityApply assumes the first parameter is the cycle number.) The binning for these parameters must be configured in an XML file such as QualityCalibration.xml - the path of this file is the first argument expected by QualityCalibration. (As yet, no code has been written to auto-generate parameter binning.) Two schemata are supported for the specification of parameter bin boundaries. One allows arbitrary binnings to be supplied explicitly as a list, e.g. :- <Parameter> <Name>Predictor</Name> <BinBoundaries>-0.7 0.3 4.2 20.5 101.0</BinBoundaries> </Parameter> The other allows concise specification of a binning in which all bins except the end ones are of equal width, e.g. :- <Parameter> <Name>Purity Score Per Base</Name> <FirstBinLowBound>-0.025</FirstBinLowBound> <FirstBinHighBound>0.525</FirstBinHighBound> <LastBinLowBound>0.975</LastBinLowBound> <LastBinHighBound>1.025</LastBinHighBound> <NumBins>11</NumBins> </Parameter> Generating the new scoresPipeline/QCalibration/QualityApply s_5_0001_qtable.txt s_5_0001_qval.txt > s_5_0001_qprb.txt Observations
TODO
AcknowledgmentsMany thanks go to Pablo Alvarez and Will Brockman, who pioneered the application of the phred scheme to Solexa data and suggested the parameters we are currently using. |
![]() |
Document generated by Confluence on Dec 19, 2007 18:32 |