Data analysis - documentation : Base-call calibration
This page last changed on May 01, 2008 by rshaw.
The analysis pipeline contains a module for quality score calibration. This is essentially a reimplementation of the phred paper (Ewing and Green, Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities, 1998, Genome Research, 8, 186). The calibration framework uses a set of trace- and base-specific parameters that are indicative of the base-call quality to predict a new calibrated quality score. The mapping between old and new scores is encoded in a ("phred") table. The mapping is calibrated on a set of known alignments. The pipeline performs the calibration procedure as a post-processing step to the base-calling. Each lane can be calibrated separately. The alignment obtained from the lane itself can be used as a training set, resulting in a procedure we refer to as auto-calibration (e.g. training set and target data are the same). In addition, a table derived from a different data set (or a control lane) can be applied (cross-calibration). The main goal of the quality calibration is to bring the quality scores (and corresponding predicted error rates) that the base-caller generates into line with the error rates obtained from an alignment. The original (Bustard) base-caller scores are essentially based on an error propagation of the estimated noise on the underlying raw cluster intensities; the methodology is described in more detail elsewhere. The base-caller generates 4 quality scores rather than just one, given the relative probabilities that the underlying read is an A, C, G or T. The scores are captured in a file format called "_prb.txt" (see Run-Folder specification). In order to accommodate the 4 different reads and obtain a sensible dynamic range, we have modified the phred formula slightly to generate a new set of quality scores that approaches the phred definition asymptotically for Q > 10-15 (Alignment Scoring Guide and FAQ). These raw scores produced by the base-caller have been shown to
Unfortunately, the calibration may be far from perfect, particularly for high quality scores Q >> 10. At the moment their dynamic range is artificially limited (currently to Q40). This is an arbitrary cut-off motivated by the fact that the quality estimation procedure used by the base-caller is unlikely to be accurate beyond the cut-off. The aim of the recalibration is to improve the correlation of the scores with the error rates obtained from alignment against a known reference. The auto-calibration procedure that can be applied to improve the quality calibration is obviously based on the assumption that the alignments are more or less correct; violations of that assumption will skew the calibration and limit the accuracy of the calibration. For example, if the sample in question contains a contaminant (e.g. Ecoli sequence) at the 5% level, the corresponding reads may - depending on read-length and target genome - be mistaken as aligned reads and contribute significantly to the error rates. A reference obtained for a different individual would limit the maximum quality to the rate at which SNPs are observed - the quality scores cannot get better than the SNP rate. The maximum alignment score is also limited by the size of the data set. For example, in order to obtain Q40, one would need around 10^4 base-pairs at a quality of Q40 (presumably even more because of the Poisson counting error). For more information on alignment related issues, see Alignment Scoring Guide and FAQ. UsageThe calibration is part of the Solexa pipeline from version 0.2.0 onwards and will be run automatically as part of the genomic analysis. In 0.4 the calibration has been rewritten to allow calibration from an external table or a control lane. Extraction of quality predictors (qval files)A simple example of base-call quality predictor extraction is :- Pipeline/QCalibration/extract_quality_predictors s_2_0059_sig2.txt s_2_0059_prb.txt s_2_0059_seq.txt s_2_0059_qval.txt Input data is read from the sig2 (per-channel intensities), prb (uncalibrated probabilities of each possible base type for each base) and seq (called bases) files and the predictor values are written to the qval file. (Note that either or both of the relatively large sig2 and qval files may be compressed and thus probably have an additional suffix; see detailed usage below. The QVAL_COMPR definition in the GERALD Makefile specifies which type of compression, if any, should be used.) The current default predictors are inspired by work done by Pablo Alvarez and Will Brockman:
The choice of predictors is configurable (see below). The format of the (uncompressed) qval.txt file is as follows: Each row corresponds to one base. The columns are tab-separated and each contains the values for one of the quality predictors - with the exception of the last column, which contains the called base types. The above example is for a single read analysis. For a paired read analysis, the --multi_readoption must be specified and the --orig_read_lengthsoption must not only be specified but also have as its value the colon-separated lengths of the two reads, e.g. --orig_read_lengths 36:36In addition, two qval files must be specified - one each for the predictor values associated with the two reads. The detailed usage of the extract_quality_predictors binary is as follows :- Info: Usage : ./extract_quality_predictors -h|--help OR : ./extract_quality_predictors [-l|--orig_read_lengths <ORIG_READ_LENGTHS> [-m|--multi_read]] [-p|--pure_bases <PURE_BASES>] [-z|--sig2_compression <gzip|bzip2>] [-Z|--qval_compression <gzip|bzip2>] [-x|--predictors <PREDICTOR_LIST>] <sig2_file> <prb_file> <seq_file> <qval_file | read1_qval_file read2_qval_file ...> where :- ORIG_READ_LENGTHS is a colon-separated list PURE_BASES is the number of cycles over which max_early_unchastity is calculated (default 12) PREDICTOR_LIST is a colon-separated list drawn from the following valid predictor names :- homopol_len in_read_cycle max_early_unchastity max_local_unchastity raw_unquality signal_decay unchastity The default list is :- in_read_cycle:unchastity:max_early_unchastity:raw_unquality The naming of the predictors reflects the fact that the prediction algorithm expects higher predictor values to correspond to lower base-call quality, e.g. `unchastity' is simply `1 - chastity'. See (Base-call calibration - experimental predictors) for descriptions of the currently implemented experimental predictors. Extraction of reference bases (qref files)In addition to the information contained in the qval files, the calibration process also requires the reference value for each base for which it is available; this is the case when the base is within a read that has been uniquely aligned. (In versions of the Pipeline before 1.0, these reference base values were included in the qval files; the current separation reflects the stages within the Pipeline at which the two types of information become available.) The called bases are derived from the combination of seqpre files with saf files (or from the corresponding align files that are produced by older analysis modes) using a Perl script. The usage of the saf2qref.pl script is as follows :- Usage: paste seqpreFile safFile | ./saf2qref.pl [--read1 | --read2] \ --use_bases USE_BASES --orig_read_lengths ORIG_READ_LENGTHS \ qref_prefix qref_suffix > qrefFile For paired read analysis, either --read1or --read2must be specified as appropriate. The USE_BASES string should correspond in length to the total number (across all reads) of cycles passed through from the Bustard analysis stage and contain only :-
The ORIG_READ_LENGTHS string should comprise a colon-separated list of the original lengths of reads passed through from the Bustard analysis stage, e.g. for paired reads, this might be `36:36'. The sum of these values should equal the length of the USE_BASES string. The qref_prefix and qref_suffix options are required so that saf2qref.pl can generate a qref file for each tile represented in the per-lane saf file. The qref_prefix should normally be the lane prefix plus a trailing underscore, e.g. `s_1_' for lane 1. The qref_suffix will normally be `_qref.txt' for single read analysis and _<READ_NUM>_qref.txt, e.g. `_1_qref.txt' for read 1, for paired read analysis. A complete example would thus be :- paste s_1_1_seqpre.txt s_1_1_saf.txt | Pipeline/Gerald/saf2qref.pl --read1 \ --use_bases YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYnyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyn \ --orig_read_lengths 36:36 s_1_ _1_qref.txt The usage of the align2qref.pl for older analysis modes is somewhat simpler as none of those modes support paired reads and as align files are per tile (note that align2qref.pl thus writes all the reference bases it extracts to standard out, unlike saf2qref.pl which generates an output file per tile) :- Usage: cat alignFile | ./align2qref.pl --use_bases USE_BASES > qrefFile The format of the qref file produced by either means is the same - a single column of reference bases. Within sets of bases corresponding to a aligned read, any bases corresponding to cycles masked out by USE_BASES will be represented by a dot (period), `.'. The bases corresponding to reads not uniquely aligned will all be represented in the same manner - as will any unknown bases in reference sequence to which reads have been uniquely aligned. Creation of a quality table (qtable file)A simple example of quality table generation is :- Pipeline/QCalibration/QualityCalibration --cfg Pipeline/QCalibration/QualityCalibration.xml \ s_5_0001_qval.txt s_5_0002_qval.txt > s_5_qtable.txt You can specify an arbitrary number of qval files. The qval file could have more than 4 predictor parameters. The calibration routines are agnostic as to the number of parameters, so it would be easy to use a different set of parameters with a different extractor script. The binning for these parameters must be configured in an XML file such as QualityCalibration.xml - the path of this file is the first argument expected by QualityCalibration. (As yet, no code has been written to auto-generate parameter binning.) Two schemata are supported for the specification of parameter bin boundaries. One allows arbitrary binnings to be supplied explicitly as a list, e.g. :- <Parameter> <Name>Predictor</Name> <BinBoundaries>-0.7 0.3 4.2 20.5 101.0</BinBoundaries> </Parameter> The other allows concise specification of a binning in which all bins except the end ones are of equal width, e.g. :- <Parameter> <Name>Purity Score Per Base</Name> <FirstBinLowBound>-0.025</FirstBinLowBound> <FirstBinHighBound>0.525</FirstBinHighBound> <LastBinLowBound>0.975</LastBinLowBound> <LastBinHighBound>1.025</LastBinHighBound> <NumBins>11</NumBins> </Parameter> QualityCalibration derives the name of a corresponding qref file (containing the reference bases that it needs) from each qval file that is specified. It does this by replacing the qval filename suffix (by default `_qval.txt') with the qref filename suffix (by default `_qref.txt'). These suffix strings can be overridden if required. The qval files can either be listed as explicit arguments to QualityCalibration or can be derived from the contents of a tile list file, by default `tiles.txt'.In the latter case, the qval file names are generated from those tile names in the tile list file that match a specified prefix, by appending the default or specified qval file suffix. In standard Pipeline practice, the prefix corresponds to a lane (e.g. `s_1' for lane 1) and the output is written to one quality table for that lane for single read analysis. For paired read analysis, two per-read quality tables are produced by separate applications of QualityCalibration to each per-read set of qval files, with specification of the latter by suffix strings that include the read number, e.g. `_1_qval.txt' could be used for read 1 and `_2_qval.txt' for read 2. QualityCalibration can read from compressed qval (and/or qref) files if this is specified. Note that the corresponding suffixes will then have to be specified in full as QualityCalibration does not automatically append compression-related extensions. The detailed usage of the QualityCalibration binary is as follows :- ./QualityCalibration --help Info: Usage : ./QualityCalibration -h|--help OR : ./QualityCalibration[-S|--qval_suffix _qval.txt] [-s|--qref_suffix _qref.txt] [-Z|--qval_compression none|gzip|bzip2] [-z|--qref_compression none|gzip|bzip2] -c|--cfg <cfg_file.xml> < <qvalue-file1> [<qvalue-file2> ...] | -T|--tile_list_file <tiles.txt> -t|--tile_prefix <lane_prefix e.g. s_3> > The qval_suffix and qref_suffix options (with default values as shown above) are used to generate qref filenames from the specified qval filenames. In addition, however, if the qval files are specified by means of a tile list file plus lane prefix filter, the qval_suffix is appended to the tile names to generate the qval filenames. In this case, non-default suffix values may be required, e.g. _1_qval.txt and _1_qref.txt for read 1 in eland pair analysis. Generating the new quality values (qcal files)A simple example of reestimation of quality values is :- Pipeline/QCalibration/QualityApply --orig_read_lengths 36:36 s_5_qtable.txt s_5_0001_qval.txt \ > s_5_0001_qcal.txt This shows the reestimation of the quality values for base-calls within the reads in one tile, based upon a quality table calculated for the whole lane. In standard Pipeline practice, however, QualityApply is run once per lane (or per read per lane) to produce a per-lane (or per-read, per-lane) qcal file from all the corresponding qval files. As with QualityCalibration, multiple qval files may be specified by means of a tile list file as an alternative to listing them explicitly. (Unlike QualityCalibration, however, QualityApply does not use qref files.) The default output format for the reestimated quality values is symbolic. Specifically, the value is represented by the ASCII character for which the code is the value added to 64 - thus allowing a range of negative as well as positive values to be compactly displayed. Each output row is a single string of such characters, corresponding to the bases of a single read. Numeric output may, however, be specified instead; each output row again corresponds to a single read but consists of space-separated numbers. The detailed usage of the QualityApply binary is as follows :- Info: Usage : ./QualityApply -h|--help OR : ./QualityApply -l|--orig_read_lengths <ORIG_READ_LENGTHS> [-r|--read READ_NUM] [-Z|--qval_compression none|gzip|bzip2] [-n|--numeric | -s|--symbolic] <qtable-file> < <qvalue-file1> [<qvalue-file2> ...] | -T|--tile_list_file <tiles.txt> -t|--tile_prefix <lane_prefix e.g. s_3> [-S|--qval_suffix _qval.txt] > If the qval files are specified by means of a tile list file plus lane prefix filter, the qval_suffix is appended to the tile names to generate the qval filenames. A non-default suffix value may be required for paired read analysis, e.g. _1_qval.txt for read 1 and _2_qval.txt for read 2. Configuring Quality Table Sources in GERALDBy default, the Pipeline generates a quality table for each lane (or for each read in the lane) in which an analysis including alignment is performed and then uses this quality table (or pair of quality tables) to reestimate the base-call quality values of all the tiles in that lane. The source of the quality table(s) used in the quality calibration for a lane may, however, be overridden by defining QCAL_SOURCE (or, for individual reads within a paired read analysis, QCAL_SOURCE1 and/or QCAL_SOURCE2) in the config.txt passed to GERALD.pl to configure GERALD analysis. The supported values of the QCAL_SOURCE variables are :-
As with any config.txt variable, the QCAL_SOURCE variables may be specified on a flowcell-wide basis or for any non-overlapping subsets of the flowcell lanes, with the latter overriding the former if both are specified. In addition, in a paired read analysis lane, specification of QCAL_SOURCE1 and/or QCAL_SOURCE2 will override specification of QCAL_SOURCE, although the latter will be used if it has been specified and not overridden for a given read. The interpretations for cases where paired read analysis is in use for either or both of the source lane and the target lane are intended to be based upon the principle of least surprise :-
As an artificial example :- ANALYSIS eland_pair QCAL_SOURCE /home/illumina/ref42_qtable.txt 123:QCAL_SOURCE auto8 4:QCAL_SOURCE1 /home/illumina/ref51_qtable.txt 4:QCAL_SOURCE2 auto7 specifies that :-
Note that even though the lane 8 qtables will not be needed in lane 8, they will still be generated for use in lanes 1-3. Observations
TODO
AcknowledgmentsMany thanks go to Pablo Alvarez and Will Brockman, who pioneered the application of the phred scheme to Solexa data and suggested the parameters we are currently using. |
![]() |
Document generated by Confluence on Jul 25, 2008 16:42 |