Data analysis - documentation : Alignment Scoring Guide and FAQ
This page last changed on Mar 22, 2007 by ac.
Copyright (c) Illumina Inc. 2007. All rights reserved. Author: Anthony J. Cox What's the difference between Solexa's base scores and Phred scores?Like Phred scores, Solexa's base scoring scheme is just a way of expressing estimates of sequencing error probability in a convenient form. Many people are familiar with Phred scores, named after the Phred base-calling software developed by Phil Green and coworkers. A Phred score of a base is Q phred =-log10(e) where e is the estimated probability of a base being wrong. If a base is estimated to have a 1% chance of being wrong, it gets a Phred score of 20. Phred score=30 corresponds to 0.1% estimated error, 40 to 0.01%, and so on. We wanted a 4-values-per-base scheme because we wanted a scheme that also encodes information on the next most likely base call. After some discussion with James Bonfield at Sanger we came up with the following: Q solexa =10log10(p(X)/(1-p(X)) You will get one positive score - that's the score of your base call - and three negative scores. To convert from a Solexa score back to a probability value use: p(X) = 1/(1+10 (Q solexa/10) )
Q solexa = Q phred + 10 log10(1-e) It's important to separate the notion of a base scoring scheme from the base scores themselves, which are just estimates of error probability as encoded by the scheme. The Bustard base caller produces error rates in the 4 values per base format, these are held in the _prb.txt files in the Bustard directory. When run in the "ANALYSIS default" mode, the pipeline uses the phageAlign program to align reads against a reference, weighting each base according to the intensity-based quality values produced by the base caller. The output of this goes in the _qalign.txt files. At the time of writing the quality scores are not calibrated, they underestimate base quality for lower quality values and overestimate it for higher values. However in "ANALYSIS default" mode the pipeline also generates its own "empirical quality values" and realigns using those. This predates intensity-based quality values being produced by the base caller and still provides a useful validation of the intensity-based quality values.
This is done as follows:
How valid is it to estimate base quality by looking at alignments?Depends on: and b) what you are doing the alignment with When estimating error rates from alignments the key thing to bear in mind that what you want is the probability that a base is wrong. However what you actually get is the probability that a base is wrong given that the read it is in uniquely alignable to your reference. This in turn depends on several things. i. How sensitive your alignment program is. The ELAND program only detects alignments with at most two errors per fragment, therefore the noisier reads having three or more errors will be ignored, meaning that error rate estimates obtained from ELAND alignments underestimate the true error rate. On the other hand, ignoring this issue completely does enable you to make spurious claims about your platform's error rate and still get published in PNAS. ii. The uniqueness of reads in your target. This in turn depends on your read length, the length of your reference sequence and how repetitive your reference sequence is. Obviously only the first of these is (mostly) under your control. If your read length is too short to compensate for the other two factors then it may be the case that a read with three errors will often also be "three errors similar" to another position in the reference, as well as to its originating position. It therefore can't be uniquely aligned. This means, again, that noisier reads tend to be 'lost' and so not contribute to the error rate calculation. So, again, the estimated error rate will underestimate the true error rate, but it's important to realise that we get the same effect as i) but for a different reason - even if we have a "perfect" alignment program we are still subject to this phenomenon. iii. Contaminants in the sample may contribute to error rates - depending on the read-length, target sample and the edit distance of the contaminant reads to the target genome. The human BAC sequencing runs we do for validation purposes mitigate both i. and ii. to a large extent, meaning that the calculated error rates closely estimate the true error rate. The phageAlign program allows any number of errors per fragment, and reads of the default read length of 25 are able to be uniquely aligned to the BAC even in the presence of 6 or 7 sequencing errors in a single fragment. The above graph (generated by Tony C in Jan 2006) shows the effect of generating simulated BAC reads whose distribution of sequencing errors matched that observed in three actual experiments. An error rate was then computed by analysing the simulated reads with the pipeline. It can be seen there is an extremely close match between the "actual" and "computed" error rates (and also that the computed error rates slightly underestimate the actual error rates, for the reasons outlined above). Where are the alignment scores in the phageAlign output?AGTAGGAGGTGAGGCGGGGAGTAGG 5171 1 150002 F AGCAGCAGCAGAAGCAGGGAGGAGG 4124 Field 1 is the read, field 2 is the score of its best alignment. Field 3 is the number of positions in the reference sequence to which the read aligns with that score. If this is 1 (ie there is a unique best match), then a further four fields are present. Fields 4 and 5 are the position and strand of the unique best match, field 6 is the portion of the reference the read aligned to, and field 7 is the score of the next highest match. Where are the alignment scores in ELAND output?Unfortunately ELAND does not (yet) have a notion of base quality - "all bases are created equal." How do Solexa scores relate to BLAST scores?(Under construction...) Short answer: not closely. Most of the existing work on alignment statistics, including Karlin-Altschul statistics and the theory behind BLAST, pertains to local alignments: given two sequences, e.g. your 500 base pair read and a public database, find the "maximal scoring pair" - subregions of the two sequences having the highest alignment score according to some scoring scheme. This problem is exactly solved by the Smith-Waterman algorithm and approximately solved more quickly by BLAST and other alignment programs. The alignment problem for Solexa and other short reads is a global alignment problem: what is the best alignment of all bases of the read in the target database? Further readinghttp://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html |
![]() |
Document generated by Confluence on Dec 19, 2007 18:32 |