This page last changed on May 01, 2008 by rshaw.

The analysis pipeline contains a module for quality score calibration. This is essentially a reimplementation of the phred paper (Ewing and Green, Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities, 1998, Genome Research, 8, 186). The calibration framework uses a set of trace- and base-specific parameters that are indicative of the base-call quality to predict a new calibrated quality score. The mapping between old and new scores is encoded in a ("phred") table. The mapping is calibrated on a set of known alignments. The pipeline performs the calibration procedure as a post-processing step to the base-calling. Each lane can be calibrated separately. The alignment obtained from the lane itself can be used as a training set, resulting in a procedure we refer to as auto-calibration (e.g. training set and target data are the same). In addition, a table derived from a different data set (or a control lane) can be applied (cross-calibration).

The main goal of the quality calibration is to bring the quality scores (and corresponding predicted error rates) that the base-caller generates into line with the error rates obtained from an alignment. The original (Bustard) base-caller scores are essentially based on an error propagation of the estimated noise on the underlying raw cluster intensities; the methodology is described in more detail elsewhere. The base-caller generates 4 quality scores rather than just one, given the relative probabilities that the underlying read is an A, C, G or T. The scores are captured in a file format called "_prb.txt" (see Run-Folder specification). In order to accommodate the 4 different reads and obtain a sensible dynamic range, we have modified the phred formula slightly to generate a new set of quality scores that approaches the phred definition asymptotically for Q > 10-15 (Alignment Scoring Guide and FAQ).

These raw scores produced by the base-caller have been shown to

  1. be monotonic predictors of base-quality even out to nominal Q scores out to at least 40,
  2. provide additional information even for the non-called base.

Unfortunately, the calibration may be far from perfect, particularly for high quality scores Q >> 10. At the moment their dynamic range is artificially limited (currently to Q40). This is an arbitrary cut-off motivated by the fact that the quality estimation procedure used by the base-caller is unlikely to be accurate beyond the cut-off. The aim of the recalibration is to improve the correlation of the scores with the error rates obtained from alignment against a known reference.

The auto-calibration procedure that can be applied to improve the quality calibration is obviously based on the assumption that the alignments are more or less correct; violations of that assumption will skew the calibration and limit the accuracy of the calibration. For example, if the sample in question contains a contaminant (e.g. Ecoli sequence) at the 5% level, the corresponding reads may - depending on read-length and target genome - be mistaken as aligned reads and contribute significantly to the error rates. A reference obtained for a different individual would limit the maximum quality to the rate at which SNPs are observed - the quality scores cannot get better than the SNP rate. The maximum alignment score is also limited by the size of the data set. For example, in order to obtain Q40, one would need around 10^4 base-pairs at a quality of Q40 (presumably even more because of the Poisson counting error). For more information on alignment related issues, see Alignment Scoring Guide and FAQ.

Usage

The calibration is part of the Solexa pipeline from version 0.2.0 onwards and will be run automatically as part of the genomic analysis. In 0.4 the calibration has been rewritten to allow calibration from an external table or a control lane.

Extraction of quality predictors (qval files)

A simple example of base-call quality predictor extraction is :-

Pipeline/QCalibration/extract_quality_predictors s_2_0059_sig2.txt s_2_0059_prb.txt s_2_0059_seq.txt s_2_0059_qval.txt

Input data is read from the sig2 (per-channel intensities), prb (uncalibrated probabilities of each possible base type for each base) and seq (called bases) files and the predictor values are written to the qval file. (Note that either or both of the relatively large sig2 and qval files may be compressed and thus probably have an additional suffix; see detailed usage below. The QVAL_COMPR definition in the GERALD Makefile specifies which type of compression, if any, should be used.)

The current default predictors are inspired by work done by Pablo Alvarez and Will Brockman:

  1. cycle number
  2. a per-base purity score
  3. a minimum purity score over the first PURE_BASES cycles
  4. raw base-caller quality score

The choice of predictors is configurable (see below). The format of the (uncompressed) qval.txt file is as follows: Each row corresponds to one base. The columns are tab-separated and each contains the values for one of the quality predictors - with the exception of the last column, which contains the called base types.

The above example is for a single read analysis. For a paired read analysis, the

--multi_read
option must be specified and the
--orig_read_lengths
option must not only be specified but also have as its value the colon-separated lengths of the two reads, e.g.
--orig_read_lengths 36:36
In addition, two qval files must be specified - one each for the predictor values associated with the two reads.

The detailed usage of the extract_quality_predictors binary is as follows :-

Info: Usage : ./extract_quality_predictors -h|--help
   OR : ./extract_quality_predictors
[-l|--orig_read_lengths <ORIG_READ_LENGTHS> [-m|--multi_read]]
[-p|--pure_bases <PURE_BASES>]
[-z|--sig2_compression <gzip|bzip2>]
[-Z|--qval_compression <gzip|bzip2>]
[-x|--predictors <PREDICTOR_LIST>]
<sig2_file> <prb_file> <seq_file>
<qval_file | read1_qval_file read2_qval_file ...>

where :-
ORIG_READ_LENGTHS is a colon-separated list
PURE_BASES is the number of cycles over which
  max_early_unchastity is calculated (default 12)
PREDICTOR_LIST is a colon-separated list drawn from the
  following valid predictor names :-
  homopol_len
  in_read_cycle
  max_early_unchastity
  max_local_unchastity
  raw_unquality
  signal_decay
  unchastity

The default list is :-
  in_read_cycle:unchastity:max_early_unchastity:raw_unquality

The naming of the predictors reflects the fact that the prediction algorithm expects higher predictor values to correspond to lower base-call quality, e.g. `unchastity' is simply `1 - chastity'.

See (Base-call calibration - experimental predictors) for descriptions of the currently implemented experimental predictors.

Extraction of reference bases (qref files)

In addition to the information contained in the qval files, the calibration process also requires the reference value for each base for which it is available; this is the case when the base is within a read that has been uniquely aligned. (In versions of the Pipeline before 1.0, these reference base values were included in the qval files; the current separation reflects the stages within the Pipeline at which the two types of information become available.)

The called bases are derived from the combination of seqpre files with saf files (or from the corresponding align files that are produced by older analysis modes) using a Perl script. The usage of the saf2qref.pl script is as follows :-

Usage: paste seqpreFile safFile | ./saf2qref.pl [--read1 | --read2] \
--use_bases USE_BASES --orig_read_lengths ORIG_READ_LENGTHS \
qref_prefix qref_suffix > qrefFile

For paired read analysis, either

--read1
or
--read2
must be specified as appropriate. The USE_BASES string should correspond in length to the total number (across all reads) of cycles passed through from the Bustard analysis stage and contain only :-

  1. `Y' at the position of a cycle to be included in read 1 analysis,
  2. 'y' at the position of a cycle to be included in read 2 analysis or
  3. 'n' at the position of any cycle to be masked out of analysis.
    Naturally, occurrences of `y' must occur strictly after occurrences of `Y'. In addition, only contiguous occurrences of `Y' and `y' respectively are supported.

The ORIG_READ_LENGTHS string should comprise a colon-separated list of the original lengths of reads passed through from the Bustard analysis stage, e.g. for paired reads, this might be `36:36'. The sum of these values should equal the length of the USE_BASES string.

The qref_prefix and qref_suffix options are required so that saf2qref.pl can generate a qref file for each tile represented in the per-lane saf file. The qref_prefix should normally be the lane prefix plus a trailing underscore, e.g. `s_1_' for lane 1. The qref_suffix will normally be `_qref.txt' for single read analysis and _<READ_NUM>_qref.txt, e.g. `_1_qref.txt' for read 1, for paired read analysis.

A complete example would thus be :-

paste s_1_1_seqpre.txt s_1_1_saf.txt | Pipeline/Gerald/saf2qref.pl --read1 \
--use_bases YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYnyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyn \
--orig_read_lengths 36:36 s_1_ _1_qref.txt

The usage of the align2qref.pl for older analysis modes is somewhat simpler as none of those modes support paired reads and as align files are per tile (note that align2qref.pl thus writes all the reference bases it extracts to standard out, unlike saf2qref.pl which generates an output file per tile) :-

Usage: cat alignFile | ./align2qref.pl --use_bases USE_BASES > qrefFile

The format of the qref file produced by either means is the same - a single column of reference bases. Within sets of bases corresponding to a aligned read, any bases corresponding to cycles masked out by USE_BASES will be represented by a dot (period), `.'. The bases corresponding to reads not uniquely aligned will all be represented in the same manner - as will any unknown bases in reference sequence to which reads have been uniquely aligned.

Creation of a quality table (qtable file)

A simple example of quality table generation is :-

Pipeline/QCalibration/QualityCalibration --cfg Pipeline/QCalibration/QualityCalibration.xml \
s_5_0001_qval.txt s_5_0002_qval.txt > s_5_qtable.txt

You can specify an arbitrary number of qval files. The qval file could have more than 4 predictor parameters. The calibration routines are agnostic as to the number of parameters, so it would be easy to use a different set of parameters with a different extractor script. The binning for these parameters must be configured in an XML file such as  QualityCalibration.xml - the path of this file is the first argument expected by QualityCalibration. (As yet, no code has been written to auto-generate parameter binning.)

Two schemata are supported for the specification of parameter bin boundaries. One allows arbitrary binnings to be supplied explicitly as a list, e.g. :-

<Parameter>
  <Name>Predictor</Name>
  <BinBoundaries>-0.7 0.3 4.2 20.5 101.0</BinBoundaries>
</Parameter>

The other allows concise specification of a binning in which all bins except the end ones are of equal width, e.g. :-

<Parameter>
  <Name>Purity Score Per Base</Name>
  <FirstBinLowBound>-0.025</FirstBinLowBound>
  <FirstBinHighBound>0.525</FirstBinHighBound>
  <LastBinLowBound>0.975</LastBinLowBound>
  <LastBinHighBound>1.025</LastBinHighBound>
  <NumBins>11</NumBins>
</Parameter>

QualityCalibration derives the name of a corresponding qref file (containing the reference bases that it needs) from each qval file that is specified. It does this by replacing the qval filename suffix (by default `_qval.txt') with the qref filename suffix (by default `_qref.txt'). These suffix strings can be overridden if required.

The qval files can either be listed as explicit arguments to QualityCalibration or can be derived from the contents of a tile list file, by default `tiles.txt'.In the latter case, the qval file names are generated from those tile names in the tile list file that match a specified prefix, by appending the default or specified qval file suffix.

In standard Pipeline practice, the prefix corresponds to a lane (e.g. `s_1' for lane 1) and the output is written to one quality table for that lane for single read analysis. For paired read analysis, two per-read quality tables are produced by separate applications of QualityCalibration to each per-read set of qval files, with specification of the latter by suffix strings that include the read number, e.g. `_1_qval.txt' could be used for read 1 and `_2_qval.txt' for read 2.

QualityCalibration can read from compressed qval (and/or qref) files if this is specified. Note that the corresponding suffixes will then have to be specified in full as QualityCalibration does not automatically append compression-related extensions.

The detailed usage of the QualityCalibration binary is as follows :-

./QualityCalibration --help
Info: Usage : ./QualityCalibration -h|--help
   OR : ./QualityCalibration[-S|--qval_suffix _qval.txt]
[-s|--qref_suffix _qref.txt]
[-Z|--qval_compression none|gzip|bzip2]
[-z|--qref_compression none|gzip|bzip2]
-c|--cfg <cfg_file.xml>
< <qvalue-file1> [<qvalue-file2> ...]
  | -T|--tile_list_file <tiles.txt>
    -t|--tile_prefix <lane_prefix e.g. s_3> >

The qval_suffix and qref_suffix options (with default values
as shown above) are used to generate qref filenames from the
specified qval filenames.

In addition, however, if the qval files are specified by means
of a tile list file plus lane prefix filter, the qval_suffix
is appended to the tile names to generate the qval filenames.
In this case, non-default suffix values may be required, e.g.
_1_qval.txt and _1_qref.txt for read 1 in eland pair analysis.

Generating the new quality values (qcal files)

A simple example of reestimation of quality values is :-

Pipeline/QCalibration/QualityApply --orig_read_lengths 36:36 s_5_qtable.txt s_5_0001_qval.txt \
> s_5_0001_qcal.txt

This shows the reestimation of the quality values for base-calls within the reads in one tile, based upon a quality table calculated for the whole lane. In standard Pipeline practice, however, QualityApply is run once per lane (or per read per lane) to produce a per-lane (or per-read, per-lane) qcal file from all the corresponding qval files. As with QualityCalibration, multiple qval files may be specified by means of a tile list file as an alternative to listing them explicitly. (Unlike QualityCalibration, however, QualityApply does not use qref files.)
Once again, the qval files may be supplied and used in compressed form if this is specified; the use of any compression extension will again require the explicit specification of a qval file suffix including it.

The default output format for the reestimated quality values is symbolic. Specifically, the value is represented by the ASCII character for which the code is the value added to 64 - thus allowing a range of negative as well as positive values to be compactly displayed. Each output row is a single string of such characters, corresponding to the bases of a single read.

Numeric output may, however, be specified instead; each output row again corresponds to a single read but consists of space-separated numbers.

The detailed usage of the QualityApply binary is as follows :-

Info: Usage : ./QualityApply -h|--help
   OR : ./QualityApply
-l|--orig_read_lengths <ORIG_READ_LENGTHS>
[-r|--read READ_NUM]
[-Z|--qval_compression none|gzip|bzip2]
[-n|--numeric | -s|--symbolic]
<qtable-file>
< <qvalue-file1> [<qvalue-file2> ...]
  | -T|--tile_list_file <tiles.txt>
    -t|--tile_prefix <lane_prefix e.g. s_3>
   [-S|--qval_suffix _qval.txt] >

If the qval files are specified by means of a tile list file
plus lane prefix filter, the qval_suffix is appended to the
tile names to generate the qval filenames. A non-default
suffix value may be required for paired read analysis, e.g.
_1_qval.txt for read 1 and _2_qval.txt for read 2.

Configuring Quality Table Sources in GERALD

By default, the Pipeline generates a quality table for each lane (or for each read in the lane) in which an analysis including alignment is performed and then uses this quality table (or pair of quality tables) to reestimate the base-call quality values of all the tiles in that lane.

The source of the quality table(s) used in the quality calibration for a lane may, however, be overridden by defining QCAL_SOURCE (or, for individual reads within a paired read analysis, QCAL_SOURCE1 and/or QCAL_SOURCE2) in the config.txt passed to GERALD.pl to configure GERALD analysis.

The supported values of the QCAL_SOURCE variables are :-

  1. auto : the qtable(s) used within the lane (to reestimate the base-call quality values) will be the qtable(s) generated for that lane (from the quality predictor values and called and reference base values of bases in reads from that lane)
  2. auto<n>, where n is the number of a lane for which alignment will be performed : the qtable(s) used in the lane will be those generated for lane n, e.g. `auto5' means that the qtable(s) from lane 5 will be used
  3. /path/to/qtable.txt : the qtable file at the specified path will be used

As with any config.txt variable, the QCAL_SOURCE variables may be specified on a flowcell-wide basis or for any non-overlapping subsets of the flowcell lanes, with the latter overriding the former if both are specified.

In addition, in a paired read analysis lane, specification of QCAL_SOURCE1 and/or QCAL_SOURCE2 will override specification of QCAL_SOURCE, although the latter will be used if it has been specified and not overridden for a given read.

The interpretations for cases where paired read analysis is in use for either or both of the source lane and the target lane are intended to be based upon the principle of least surprise :-

  1. If both lanes are paired, then any specification of the source lane for read 1 of the target lane results in the read 1 qtable of the source lane being used as the read 1 qtable in the target lane - and similarly for read 2.
  2. If only the target lane is paired, then there is only one qtable available in the source lane but it may be used for both reads in the target lane.
  3. If only the source lane is paired, then its read 1 qtable is (arbitrarily) used.

As an artificial example :-

ANALYSIS eland_pair
QCAL_SOURCE /home/illumina/ref42_qtable.txt
123:QCAL_SOURCE auto8
4:QCAL_SOURCE1 /home/illumina/ref51_qtable.txt
4:QCAL_SOURCE2 auto7

specifies that :-

  1. lanes 1-3 read 1 will use the lane 8 read 1 qtable; lanes 1-3 read 2 will use the lane 8 read 2 qtable
  2. lane 4 read 1 will use the external ref51_qtable.txt qtable
  3. lane 4 read 2 will use the lane 7 read 2 qtable
  4. lanes 5-8 reads 1 and 2 will use the ref42_qtable.txt qtable

Note that even though the lane 8 qtables will not be needed in lane 8, they will still be generated for use in lanes 1-3.

Observations

  • Best achievable Q score depends on total number of data points (because of the 1+... term in the version of the phred formula implemented in the paper).
  • The quality of alignments in training data set is also crucial.
  • The parameters used are not always strictly monotonic predictors for data quality. Furthermore, there is a tendency of the phred algorithm to overfit and there are cases when subsets of the data actually have higher quality scores than earlier rules.

TODO

  • Auto-generate equal occupancy binnings.
  • Use better rules for finding maxima than suggested in the phred paper - for example, the binning currently restricted to rounded values of the phred scores, and the number of bins in different dimensions may not be the same - so the "maximum sum of indices" criterion is not very useful (it's presumably not very useful in the original paper either, unless you can really make sure that all bins have similar occupancy numbers).

Acknowledgments

Many thanks go to Pablo Alvarez and Will Brockman, who pioneered the application of the phred scheme to Solexa data and suggested the parameters we are currently using.

Document generated by Confluence on Jul 25, 2008 16:42