This page last changed on Jun 06, 2008 by rshaw.

The Summary.htm page provides an overview of quality metrics for a run and links to more detailed information in the form of pages of graphs. It is intended to be quick to load into a browser; depending upon the number of lanes and tiles used, the pages to which it links may take much longer to display.

As well as being an HTML page, Summary.htm is also a valid XML file (or at least meets the expectations of the Perl XML::Simple module) to facilitate automated information extraction. The format of Summary.htm may, however, change to some extent between Pipeline releases, e.g. to provide additional statistics relevant to new analysis modes.

Chip Summary

This reports the instrument ID and the run folder. The Chip ID is a placeholder field; currently `unknown'. (The terms `chip' and `flowcell' are used interchangeably.)

Chip Results Summary

This table displays Summary chip-wide performance statistics for the run. Both the original number of detected clusters and the number that passed quality filtering are shown. In addition, a chip yield in kilobases is presented. This is the sum over analysed lanes of the product of number of quality-filtered clusters and number of bases per cluster used for analysis, i.e. excluding bases masked-out by a USE_BASES directive.

Lane Parameter Summary

This records information about the sample in each flowcell lane and the analysis that has been specified for it :-

  • Sample ID: placeholder field; currently `unknown' (undocumented approaches to supplying this value should be considered unsupported)
  • Sample Target: the reference sequence(s) against which reads from the sample in this lane are to be aligned. Depending on analysis mode this may be the name of a folder containing one or more sequence (and auxiliary) files or the name of an individual file; the required file formats also depend on analysis mode.
  • Sample Type: the analysis mode (the options are constrained by the nature of the sample) for reads from this lane.
  • Length: the number of bases used per read (excluding any bases masked out using USE_BASES); where multiple reads are produced per cluster and a distinction is maintained between them during analysis (e.g. eland_pair analysis of paired end reads), their respective lengths will be listed.
  • Filter: the criterion for clusters to be selected for analysis beyond the preliminary stages (statistics for all detected clusters and for the subset that pass filtering are annotated as `raw' and `PF', respectively in Summary.htm).
  • Num Tiles: the number of tiles from the lane that will be used in the analysis
  • Tiles: a hyperlink for each lane to the location (still within Summary.htm) of the statistics for individual tiles in that lane.

Lane Results Summary

This table displays basic data quality metrics for each lane. Apart from Lane Yield, which is the total value for the lane, all the statistics are given as means and standard deviations over the tiles (used) in the lane :-

  • Clusters (raw): the number of clusters detected by the image analysis stage (Firecrest) of the Pipeline
  • Clusters (PF): the number of detected clusters that meet the filtering criterion (see Lane Parameter Summary)
  • 1st Cycle Int (PF): the average of the four intensities (one per channel, i.e. base type) measured at the first cycle (after any masking of cycles), averaged over filtered clusters.
  • % intensity after 20 cycles (PF): the corresponding intensity statistic at (masked) cycle 20 as a percentage of that at the first cycle.
  • % PF Clusters: the percentage of clusters passing filtering
  • % Align (PF): the percentage of filtered reads that were uniquely aligned to the reference
  • Alignment Score (PF): the average filtered read alignment score (reads with multiple alignments or none effectively contribute scores of 0)
  • % Error Rate (PF): the percentage of called bases in aligned reads that do not match the reference

If eland_pair analysis has been specified for one or more lanes, then there will be two such summaries - one for each read. All lanes for which analysis has been specified will be represented in `Expanded Lane Summary : Read 1' but only those for which eland_pair analysis has been specified will contribute statistics to the Read 2 table.

Expanded Lane Summary

This displays more detailed quality metrics for each lane. Apart from the phasing and prephasing information, all values are tile means for the lane.

  • Clusters (tile mean) (raw): the number of clusters detected by the image analysis stage (Firecrest) of the Pipeline
  • % Phasing: the estimated (or specified) value used by the Pipeline for the percentage of molecules in a cluster for which sequencing falls behind the current position (cycle) within a read
  • % Prephasing: the estimated (specification is not recommended) value used by the Pipeline for the percentage of molecules in a cluster for which sequencing jumps ahead of the current position (cycle) within a read
  • % Error Rate (raw): the percentage of called bases in aligned reads (from all detected clusters) that do not match the reference
  • Equiv Perfect Clusters (raw): the number of clusters in the ideal situation of read base perfectly predicting reference base that would provide the same information content (entropy of reference base given read base and a priori assumption of equiprobable reference bases) as calculated for all actual detected clusters
  • % retained: the percentage of clusters that passed filtering
  • Cycle 2-4 Av Int (PF): the intensity averaged over cycles 2, 3 and 4 for clusters that passed filtering
  • Cycle 2-10 Av % Loss (PF): the average percentage intensity drop per cycle over cycles 2 to 10 (derived from a best fit straight line for log intensity v. cycle number)
  • Cycle 10-20 Av % Loss (PF): the average percentage intensity drop per cycle over cycles 10 to 20 (derived from a best fit straight line for log intensity v. cycle number)
  • % Align (PF): the percentage of filtered reads that were uniquely aligned to the reference
  • % Error Rate (PF): the percentage of called bases in aligned filtered reads that do not match the reference
  • Equiv Perfect Clusters (PF): the number of clusters in the ideal situation of read base perfectly predicting reference base that would provide the same information content (entropy of reference base given read base and a priori assumption of equiprobable reference bases) as calculated for the actual clusters that passed filtering

In the same manner as for the Lane Results Summary, specification of eland_pair analysis for any of the lanes will result in two Expanded Lane Summary tables - displaying statistics for Read 1 and (for lanes where it exists) Read 2.

Per-Tile Statistics

Below the two types of lane summary are per-tile statistics, grouped into a table for each lane. The statistics are a subset of those already presented in the Lane Results Summary but in these tables are averages over the detected (raw) or filtered (PF) clusters in individual tiles.

In the event that no clusters in a tile pass filtering, all the statistics for that tile will be displayed within square brackets. Such an occurrence suggests an exceptional situation (e.g. a bubble) within the tile; the brackets indicate the tile has thus been excluded from the calculation of lane statistics and that the values are reported only for diagnostic purposes.

Monotemplate Summary

This table appears after the per-tile summary table for lanes for which monotemplate analysis is specified (corresponding to monotemplate samples). Statistics are presented for each monotemplate specified :-

  • Lane: lane number
  • Template: the monotemplate sequence
  • Count: the number of reads that aligned to the monotemplate
  • Percent: the percentage of reads aligned to monotemplates that aligned to the current monotemplate
  • True 1st Cycle Intensity: the average intensity of the first base in reads aligned to the monotemplate
  • Av Error Rate: the average error rate over all cycles as a percentage of called bases for reads aligned to the monotemplate
  • % Perfect: the percentage of reads that are a perfect match to the monotemplate out of those that align to it

Pair Summary

For lanes for which eland_pair analysis was performed, there will be two per-tile summary tables (one for each read) and these will be preceded by a set of tables collectively entitled the `Pair Summary'. These provide statistics about the alignment outcomes of the two reads individually and as a pair, the latter including relative orientation and separation (insert size) of partner read alignments.

Note that if the criteria for paired alignment to be attempted are not met, the subset of tables reporting paired alignment results will be replaced with the statement `Paired alignment not performed'.

Individual Alignments

This table displays the frequencies of the various possible combinations of individual alignment outcomes within a pair, i.e. for a particular Read 1 alignment outcome what its Read 2 alignment outcome was. The possible outcomes are :-

  • Unique : a unique alignment
  • Rescuable : multiple alignments but such that a unique paired alignment could potentially be derived if the partner read is either unique or similarly rescuable
  • Repeat : multiple alignments such that consideration of the partner read will not help in selecting between them
  • Not Matched : the read was not aligned
  • Low Quality : the read contained too many uncalled bases to attempt alignment

Here is a real example, taken from an e.coli dataset.

Read 1 \ Read 2 Unique Rescuable Repeat Not Matched Low Quality Total
Unique 2443861 (93.2%) 17547 (0.7%) 395 (0.0%) 31818 (1.2%) 4292 (0.2%) 2497913 (95.3%)
Rescuable 17670 (0.7%) 54356 (2.1%) 929 (0.0%) 897 (0.0%) 118 (0.0%) 73970 (2.8%)
Repeat 405 (0.0%) 920 (0.0%) 289 (0.0%) 9 (0.0%) 2 (0.0%) 1625 (0.1%)
Not Matched 26724 (1.0%) 776 (0.0%) 10 (0.0%) 6928 (0.3%) 240 (0.0%) 34678 (1.3%)
Low Quality 138 (0.0%) 2 (0.0%) 0 (0.0%) 19 (0.0%) 14037 (0.5%) 14196 (0.5%)
Total 2488798 (94.9%) 73601 (2.8%) 1623 (0.1%) 39671 (1.5%) 18689 (0.7%) 2622382 (100.0%)

This table says, for example, that 2497913 clusters in total were such that read 1 had a unique best alignment in the reference and, of those, 2443861 were also such that read 2 had a unique alignment. In total, 2622382 clusters passed purity filtering.

More detail on the difference between 'rescuable' and 'repeat'

ELAND maintains a list of best alignments for each read. By default, this list can contain at most 10 alignments. The maximum size of the list may be increased (or decreased) by specifying ELAND_MAX_MATCHES in the GERALD config file. As usual, this can be done on a lane-by-lane basis if desired.

Reads for which ELAND has retained a list of alignments may potentially be 'rescued' if only one of the possible alignments has the correct insert size and orientation with respect to the other read of the cluster. These are called 'rescuable.' Not all rescuable reads will be rescued; this is dependent on the read pairing algorithm finding a single consistent pairing of the reads among the list of possibilities. The read pairing algorithm's ability to do this in turn depends on factors such as the quality of the sample prep and the degree of structural variation between the reference sequence being aligned to and the sample being sequenced. Conversely, if the number of possible alignments exceeds ELAND_MAX_MATCHES then ELAND stores only the number of matches found. There is therefore no potential for picking the best alignment so the read is classed as a 'repeat.'

Thus if the analysis were to be re-run with a higher value of ELAND_MAX_MATCHES, some reads that were classed as repeats would become rescuable, but the sum of rescuable and repeat reads will stay the same.

Unique Paired Alignments

If we have at least one alignment for each of the two reads in a cluster, then there are three possible outcomes:

  1. It is only possible to pick one alignment for read 1 and one alignment for read 2 such that the relative orientation of the two alignments and the distance between the alignment positions are both consistent with our knowledge of the sample - we call this a 'unique paired alignment' of the cluster.
  2. There is more than one way to pick one alignment for read 1 and one alignment for read 2 such that the relative orientation of the two alignments and the distance between the alignment positions are both consistent with our knowledge of the sample - we call this a 'non-unique paired alignment' of the cluster.
  3. There is no way to pick one alignment for read 1 and one alignment for read 2 such that the relative orientation of the two alignments and the distance between the alignment positions are both consistent with our knowledge of the sample - we call this an 'inconsistent pair.'

This table breaks down the unique paired alignments according to the alignment outcomes of their component reads, which can thus only be unique or rescuable.

Note that the frequencies in this table will often be somewhat lower than the corresponding values in the Individual Alignments table, due to the conditions required for a unique paired alignment. For example, even when both reads are individually uniquely aligned it is possible that their relative positions or orientations will not be compatible with the paired read model. If the alignment positions are unexpectedly far apart, unexpectedly close together or even on different chromosomes, this could be due to some form of error or could represent a genuine feature in the sample (e.g. deletion, insertion or translocation, respectively). Such reads will not, however, be represented in this table.

Continuing our example from the e.coli dataset:

Read 1 \ Read 2 Unique Rescuable Total
Unique 2437627 (98.5%) 17125 (0.7%) 2454752 (99.2%)
Rescuable 17266 (0.7%) 2260 (0.1%) 19526 (0.8%)
Total 2454893 (99.2%) 19385 (0.8%) 2474278 (100.0%)

Thus in total 2474278 clusters were such that there is just one possible alignment of each read such that the two reads have the appropriate relative position and orientation with respect to one another. Of these, 2437637 clusters were such that there was only one possible alignment of each of the two reads; in other cases, the read pairing algorithm has had to pick from a list of multiple possibilities for one or both of the two reads.

Common question

In the 'Individual Alignments' table above, the table said 2443861 clusters were such that read 1 and read 2 each had a unique alignment, but here we are saying 2437647 clusters have a unique paired alignment. Why are these figures different? The answer is that the remaining 2443861-2437627=6234 clusters are such that, even though there's only a single alignment for each read, the relative position and/or orientation between the alignment of the two reads does not match that of the rest of the sample. This could be due to:

  • structural variation between the reference being aligned to and the sample being sequenced
  • misalignment of one or both reads
  • the distance between two reads being at the extreme end of the insert size distribution

Unique Paired Alignment Effects

This table is similar to the Unique Paired Alignments one but focuses upon what proportions of unique paired alignments were rescued, i.e. one or both of the partner reads were not individually aligned uniquely.

Continuing the example:

Effect \ Read Read 1 Read 2 All Reads Total
Uniqueness Rescue (% of rescuable) 17266 (97.7%) 17125 (97.6%) 2260 (4.2%) 36651 (40.9%)

The Individual Alignments table says that 17547 reads were such that read 1 was unique and read 2 is (potentially) rescuable. This table shows that, in fact, unique alignments were found for 17125 or 97.6% of these reads.

Non-unique Paired Alignments

This table breaks down the non-unique paired alignments according to the alignment outcomes of their component reads. (Paired alignment is attempted only for those pairs where the individual alignments of both partners are either unique or rescuable.)

Ploughing onward with our example:

Read 1 \ Read 2 Unique Rescuable Total
Unique 0 (0.0%) 58 (0.1%) 58 (0.1%)
Rescuable 63 (0.1%) 51957 (99.8%) 52020 (99.9%)
Total 63 (0.1%) 52015 (99.9%) 52078 (100.0%)

Note that a small number of clusters are non-unique even though there is a unique alignment of one of the two reads. This means there is more than one alignment of the other read that is consistent with the alignment of the first, which might occur if it aligns to a tandem repeat or low-complexity region. Because we can be reasonably certain of the position of one of the two reads in the cluster, one could argue that such clusters ought to be classified as 'semi-unique' or similar. Such arguments are not, however, likely to result in a change to the software.

Mispairing Rate

Mispairing is considered to happen when one read of a pair can be aligned (whether the alignment is unique, rescuable or against a repeat) but the other can not (whether because it is of low quality or simply no match can be found for it in the reference).

Continuing with the example:

Read 1 Lost Read 2 Lost
27650 (1.1%) 37136 (1.4%)

The individual alignments table shows that, of the clusters such that read 1 could not be aligned, 26724 had read 2s that were uniquely aligned, 776 had read 2s that were rescuable and 10 had read 2s were repeats. Of the clusters with low quality read 1s, 138 had uniquely aligning read2s, 2 had rescuable read 2s and none had repetitive read 2s. Summing these figures together gives the 'read 1 lost' figure of 27650 in this table.

Relative Orientation Statistics

The relative orientation of a pair is the orientation of Read 2 relative to that of Read 1, i.e. defining the Read 1 orientation to be forward. It is defined as positive if the Read 2 position is greater than the Read 1 position. These statistics are given only for those pairs in which both reads were individually uniquely aligned, since these are the ones used to determine the predominant relative orientation (other orientations are considered anomalous and filtered out).

The ASCII art in the column headings is intended as a visual reminder of the definitions of the four possible relative orientations.

For our example the nominal orientation is correctly computed as the two reads 'pointing to' each other, as would be expected for the standard Illumina short insert paired read sample prep.

F-: > R2 R1 > F+: > R1 R2 > R-: < R2 R1 > R+: > R1 R2 < Total
184 (0.0%) 161 (0.0%) 243 (0.0%) 2443273 (100.0%) 2443861

Insert Size Statistics

Statistics are derived from the insert sizes of those pairs in which both reads were individually uniquely aligned and which also have the predominant relative orientation. First the median is determined. Then a standard deviation value is determined independently for those values below the median and those above it. The lower and upper thresholds for acceptable insert sizes are then defined as three of the relevant standard deviations below and above the median, respectively.

Median Below-median SD Above-median SD Low thresh. High thresh.
214 10 11 184 247

Insert Statistics (% of individually uniquely alignable pairs)

This table shows the numbers of inserts (out of those used to calculate insert size statistics) considered acceptable in size and of those falling outside the thresholds displayed in the above table. The percentages are relative to the original number of pairs in which both reads were individually uniquely aligned.

Too small Too large Orientation and size OK
3945 (0.2%) 1701 (0.1%) 2437627 (99.7%)

The example 'Individual Alignments' table showed that 2443861 clusters were such that read 1 and read 2 each had a unique alignment and the example Unique Paired Alignment table showed that 2437627 clusters have a unique paired alignment. As we said above, this leaves 2443861-2437627=6234 clusters to account for. Taken together, the Relative Orientation Statistics and Insert Statistics tables describe the fate of these clusters.

  • 3945 had a correct orientation but had an implied insert size that was too small.
  • 1701 clusters were correctly oriented but had an implied insert size that was too large.
  • 184, 161 and 243 clusters were oriented in the three possible ways an orientation can be wrong.

These five figures sum to 6234, as we would expect.

Document generated by Confluence on Jul 25, 2008 16:42