This page last changed on Jan 11, 2008 by maising.

Copyright (c) Illumina Inc. 2007. All rights reserved.

Author: Anthony J. Cox

What's the difference between Solexa's base scores and Phred scores?

Like Phred scores, Solexa's base scoring scheme is just a way of expressing estimates of sequencing error probability in a convenient form.

Many people are familiar with Phred scores, named after the Phred base-calling software developed by Phil Green and coworkers. A Phred score of a base is

Q phred =-log10(e)

where e is the estimated probability of a base being wrong. If a base is estimated to have a 1% chance of being wrong, it gets a Phred score of 20. Phred score=30 corresponds to 0.1% estimated error, 40 to 0.01%, and so on.

We wanted a 4-values-per-base scheme because we wanted a scheme that also encodes information on the next most likely base call. After some discussion with James Bonfield at Sanger we came up with the following:

Q solexa =10log10(p(X)/(1-p(X))

You will get one positive score - that's the score of your base call - and three negative scores.

To convert from a Solexa score back to a probability value use:

p(X) = 1 - 1/(1+10 (Q solexa/10) )

Important point (so important it merits its own box): the highest Solexa base score and the Phred score are asymptotically identical. In English this means that for scores of about 15 and above they are so close as to be effectively the same. For the precise of mind the exact formula is:

Q solexa = Q phred + 10 log10(1-e)

It's important to separate the notion of a base scoring scheme from the base scores themselves, which are just estimates of error probability as encoded by the scheme. The Bustard base caller produces error rates in the 4 values per base format, these are held in the _prb.txt files in the Bustard directory.

When run in the "ANALYSIS default" mode, the pipeline uses the phageAlign program to align reads against a reference, weighting each base according to the intensity-based quality values produced by the base caller. The output of this goes in the _qalign.txt files. At the time of writing the quality scores are not calibrated, they underestimate base quality for lower quality values and overestimate it for higher values.

However in "ANALYSIS default" mode the pipeline also generates its own "empirical quality values" and realigns using those. This predates intensity-based quality values being produced by the base caller and still provides a useful validation of the intensity-based quality values.

At the moment phageAlign still uses the scoring scheme S=100log10(p(X)/(1-p(X)), ie 10 times as high as the _prb.txt scores. This will be fixed in the next pipeline release.

This is done as follows:

  1. A first phageAlign alignment is done giving all bases equal weight - stored in the _align.txt files. A match base is arbitrarily given a score of 500, equating to 100000:1 chance of the base being wrong. If the base is assumed to have an equal probability of being mis-sequenced as each of the three possible mismatches, this forces a score of -547 for a mismatch.
  2. This alignment is used to compute an empirical error probability for each base at each cycle.
  3. These probabilities are converted to Solexa scores (stored in the _score.txt files)
  4. The sequences are realigned, weighting the bases according to the alignment scores. These go in the _prealign.txt files. These are purity filtered and the results go in the _realign.txt files. Compare also What do the different files in an analysis directory mean.

How valid is it to estimate base quality by looking at alignments?

Depends on:
a) what you are aligning against

and

b) what you are doing the alignment with

When estimating error rates from alignments the key thing to bear in mind that what you want is the probability that a base is wrong. However what you actually get is the probability that a base is wrong given that the read it is in uniquely alignable to your reference. This in turn depends on several things.

i. How sensitive your alignment program is. The ELAND program only detects alignments with at most two errors per fragment, therefore the noisier reads having three or more errors will be ignored, meaning that error rate estimates obtained from ELAND alignments underestimate the true error rate.

On the other hand, ignoring this issue completely does enable you to make spurious claims about your platform's error rate and still get published in PNAS.

ii. The uniqueness of reads in your target. This in turn depends on your read length, the length of your reference sequence and how repetitive your reference sequence is. Obviously only the first of these is (mostly) under your control. If your read length is too short to compensate for the other two factors then it may be the case that a read with three errors will often also be "three errors similar" to another position in the reference, as well as to its originating position. It therefore can't be uniquely aligned.

This means, again, that noisier reads tend to be 'lost' and so not contribute to the error rate calculation. So, again, the estimated error rate will underestimate the true error rate, but it's important to realise that we get the same effect as i) but for a different reason - even if we have a "perfect" alignment program we are still subject to this phenomenon.

iii. Contaminants in the sample may contribute to error rates - depending on the read-length, target sample and the edit distance of the contaminant reads to the target genome.

The human BAC sequencing runs we do for validation purposes mitigate both i. and ii. to a large extent, meaning that the calculated error rates closely estimate the true error rate. The phageAlign program allows any number of errors per fragment, and reads of the default read length of 25 are able to be uniquely aligned to the BAC even in the presence of 6 or 7 sequencing errors in a single fragment.

The above graph (generated by Tony C in Jan 2006) shows the effect of generating simulated BAC reads whose distribution of sequencing errors matched that observed in three actual experiments. An error rate was then computed by analysing the simulated reads with the pipeline. It can be seen there is an extremely close match between the "actual" and "computed" error rates (and also that the computed error rates slightly underestimate the actual error rates, for the reasons outlined above).

Where are the alignment scores in the phageAlign output?

AGTAGGAGGTGAGGCGGGGAGTAGG 5171 1 150002 F AGCAGCAGCAGAAGCAGGGAGGAGG 4124

Field 1 is the read, field 2 is the score of its best alignment. Field 3 is the number of positions in the reference sequence to which the read aligns with that score. If this is 1 (ie there is a unique best match), then a further four fields are present. Fields 4 and 5 are the position and strand of the unique best match, field 6 is the portion of the reference the read aligned to, and field 7 is the score of the next highest match.

Where are the alignment scores in ELAND output?

Unfortunately ELAND does not (yet) have a notion of base quality - "all bases are created equal."

How do Solexa scores relate to BLAST scores?

(Under construction...)

Short answer: not closely. Most of the existing work on alignment statistics, including Karlin-Altschul statistics and the theory behind BLAST, pertains to local alignments: given two sequences, e.g. your 500 base pair read and a public database, find the "maximal scoring pair" - subregions of the two sequences having the highest alignment score according to some scoring scheme. This problem is exactly solved by the Smith-Waterman algorithm and approximately solved more quickly by BLAST and other alignment programs.

The alignment problem for Solexa and other short reads is a global alignment problem: what is the best alignment of all bases of the read in the target database?

Further reading

http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
http://blast.wustl.edu/doc/infotheory.html


error_rate.PNG (image/x-png)
Qsolphred.png (image/png)
Document generated by Confluence on Jan 11, 2008 15:41