This page last changed on Dec 11, 2007 by ac.
Purpose

This document describes the two new analysis modes in version 0.3 of the pipeline - ANALYSIS eland_extended and ANALYSIS eland_pair

Executive Summary

There are two new modes - eland_extended and eland_pair.

  • ANALYSIS eland_extended is an improved version of the existing ANALYSIS eland mode for aligning single-read data against a target reference sequence.
  • ANALYSIS eland_pair is for aligning paired reads against a target reference sequence

The important data is contained in one file of each type per lane (or two files per lane in most cases for eland_pair mode). The output files users should be most interested in are:

  • s_N_export.txt - results of alignment of all reads in the lane. One line per read, reads ordered to match their ordering in the files produced by the basecaller. Fields tab separated to facilitate export to databases. This file has a line for every read - not just those that pass purity filtering - but the last field on each line is a flag telling you whether the read passed the filter or not.
  • s_N_sorted.txt - similar to s_N_export.txt, except there are only entries for reads which pass purity filtering and have a unique alignment in the reference. These are sorted by order of their alignment position, which is meant to facilitate the extraction of ranges of reads for purposes of e.g. visualization or SNP calling.
  • s_N_anomaly.txt file - this contains one line for each read for which the two halves of the read did not align with a nominal distance and orientation from each other. This is the file to mine for structural variation information.

Also, both modes assign a quality score to each alignment, which is obtained by combining base quality information with the existence of similar alignments.

More details below.

Existing (pipeline release <=0.2.2.6) GERALD analysis modes

The version of GERALD in the v0.2.2.6 pipeline release may be run in the following four modes:

ANALYSIS default

Align to reference sequence using phageAlign.

ANALYSIS monotemplate

Align to set of tags with phageAlign.

ANALYSIS expression

Align to set of tags with phageAlign.

ANALYSIS eland

Align to (large) reference sequence using ELAND

Backwards Compatibility

For pipeline version 0.3 all these modes are unchanged - users should be able to carry on using them exactly as they did previously.

New (pipeline release >=0.3) GERALD analysis modes

The export.txt file

Both eland_extended and eland_pair share a common export.txt file format that contains all read, quality value and alignment information for a lane of data. However an eland_pair analysis of a lane will produce two such files (named e.g. s_6_1_export.txt and s_6_2_export.txt), one for each of the two reads, whereas an eland_extended analysis will produce a single such file per lane (named e.g. s_6_export.txt). Moreover not all the fields are relevant to a single-read analysis so will be left blank.

The format is meant to be easily exportable to databases.

  • The fields are tab-separated with a constant number of fields (22) per line.
  • The last field on a line is a Y/N flag telling you whether each read has passed filtering or not.
  • The ordering of sequences matches the ordering in the _seq.txt files, making it easier to potentially read in intensity information at a later stage.

Format of export.txt file

Democracy is necessarily despotism, 
as it establishes an executive power contrary to the general will; 
all being able to decide against one whose opinion may differ, 
the will of all is therefore not that of all: 
which is contradictory and opposite to liberty. 
- Immanuel Kant

The number of fields per line remains a constant 22, any not relevant to a particular read are left blank (the empty string ""). In particular, for a single-read analysis the Read Number, Paired Read Alignment Score and Partner Chromosome/Contig/Offset/Strand fields will all be blank

  1. Machine (as parsed from run folder name.
  2. Run Number (as parsed from run folder name).
  3. Lane.
  4. Tile.
  5. X Coordinate of cluster.
  6. Y Coordinate of cluster.
  7. Index String (blank for a non-indexed run).
  8. Read Number ("1" or "2" for paired read, blank for a single read).
  9. Read.
  10. Quality String - in symbolic ASCII format (ASCII chracter code = quality value + 64) by default, set QUALITY_FORMAT --numeric in the GERALD config file to get numeric values instead.
  11. Match Chromosome - name of chromosome match was to OR code indicating why no match was done.
  12. Match Contig (blank if no match found) - gives contig name if there is a match and the match chromosome is split into contigs.
  13. Match Position (always with respect to forward strand, numbering starts at 1).
  14. Match Strand ("F" for forward or "R" for reverse, blank if no match).
  15. Match Descriptor - concise description of alignment. A numeral denotes a run of matching bases, a letter denotes substituation of a nucleotide, so e.g. for a 35 base read, "35" denotes an exact match and "32C2" denotes substitution of a "C" at the 33rd position.
  16. Single Read Alignment Score - alignment score of single read match (if a paired read, gives alignment score of read if it were to be treated as a single read).
  17. Paired Read Alignment Score - alignment score of read pair (alignment score of a paired read and its partner, taken as a pair. Blank for a single read run).
  18. Partner Chromosome - not blank only if read is paired and its partner aligns to another chromosome, in which case it gives the name of the chromosome.
  19. Partner Contig - not blank only if read is paired and its partner aligns to another chromosome and that partner is split into contigs.
  20. Partner Offset - if a paired read's partner hits to the same chromosome (as it will in the vast majority of cases) and contig (if the chromosome is split into contigs) then this number added to Match Position gives the alignment position of the read's partner.
  21. Partner Strand - which strand did the paired read's partner hit to("F" for forward or "R" for reverse, blank if no match).
  22. Filtering. Did the read pass quality filtering? "Y" for yes, "N" for no.

Multiple-match ELAND

Both new modes are built around the new multiple match ELAND output obtained by specifying --multi on the command line for ELAND.

Example of raw ELAND output

Just for interest, here are examples of the same data aligned using "standard" and multiple-match ELAND. Note carefully that both eland_pair and eland_extended treat the multiple-match ELAND output as raw data and apply considerable post-processing to it. For export into a user's own application other files - most notably the export.txt format - should be preferred over the "raw" ELAND output, in particular because the raw ELAND output has not had signal purity filtering applied.

Standard ELAND format (separating tabs changed to spaces for clarity)

>SLXA-B3_604:6:1:533:275 GATTATTCACGATGGGGAACTGGTTTTTGATTTTTT U0 1 0 0 BAC_plus_vector.fa 153585 R ..
>SLXA-B3_604:6:1:562:335 TTTCGGTGTTTTCATTTTTTTTTCCTTTGTTAAATC U0 1 0 0 BAC_plus_vector.fa 68238 R ..
>SLXA-B3_604:6:1:613:330 AATAATCAGACCGACGATACTAGTGGGACCGTGGTC R1 0 6 1

ELAND --multi format

>SLXA-B3_604:6:1:533:275 GATTATTCACGATGGGGAACTGGTTTTTGATTTTTT 1:0:0 BAC_plus_vector.fa:153585R0
>SLXA-B3_604:6:1:562:335 TTTCGGTGTTTTCATTTTTTTTTCCTTTGTTAAATC 1:0:0 BAC_plus_vector.fa:68238R0
>SLXA-B3_604:6:1:613:330 AATAATCAGACCGACGATACTAGTGGGACCGTGGTC 0:6:1
BAC_plus_vector.fa:168839R1,168882R1,168968R1,169011R2,169054R1,169097R1,169140R1

The read ID

Each read is given a unique identifier string, which is used to identify the read in most of the intermediate files mentioned below.

>SLXA-B3_604:6:1:533:275 

denotes an unpaired read. It is from lane 6 tile 1 of run 604 on machine SLXA-B3, and the (X,Y) coordinates of the cluster on the tile (in pixel units) are (533,275). Taken together these data are sufficient to uniquely identify a cluster - importantly for multi-run projects, this uniqueness should extend across runs, although this is of course reliant upon machine naming and run numbering being done in a sensible way. Furthermore

>SLXA-B3_604:6:1:533:275/1 

would denote read 1 of a paired read and

>SLXA-B3_604:6:1:533:275/2 

denotes read 2 of a paired read.

Note that for the final export.txt format the read ID is split into tab separated fields.

Handling indexing

To support possible future applications, pipeline version 0.3 allows a subset of bases in the read to be specified as index bases (otherwise known as a bar code). These bases are removed from the read and incorporated into the read ID, they play no other part in the analysis. However, having the index string embedded in the read ID in this way gives the potential for reads and/or alignments to be easily separated by index for the purposes of e.g. SNP calling or assembly. Index bases are specified by an extension to the syntax for the USE_BASES parameter in the GERALD config file - see the user guide for details.

Single read example:
>SLXA-B3_604:6:10:838:622#TCA
Paired read example:
>SLXA-B3_604:6:10:838:622#TCA/1

Again note that for the final export.txt format the read ID is split into tab separated fields, of which the index string is one.

ANALYSIS eland_extended

This is an upgraded version of the 'ANALYSIS eland' mode. From pipeline release >=0.2.2 onwards 'ANALYSIS eland' supports >32 base reads in a manner that owing to time pressure is somewhat rudimentary. Idea of eland_extended was to improve this support and also prototype ideas and file formats for ANALYSIS eland_pair (see below).

The intention for pipeline 0.3 is to leave 'ANALYSIS eland' exactly as is for users who have built its file formats etc. into their infrastructure, but encourage users to migrate to 'ANALYSIS eland_extended.'

'ANALYSIS eland' can align reads longer than 32 bases but demands that the first 32 bases of the read have a unique best match in the genome. The position of this match is used as a 'seed' to extend the match along the full length of the read. ANALYSIS eland_extended removes the uniqueness restriction by allowing multiple 32-mer matches to be considered and extended. The length of the initial 'seed' alignment can also be varied from the default 32 bases.

Configuring ANALYSIS eland_extended

There are two parameters that affect the output of the alignment, both can be specified per-lane of course.

  • ELAND_SEED_LENGTH
    By default the first 32 bases of the read are used as a 'seed' alignment. Setting ELAND_SEED_LENGTH to (say) 25 will use instead 25-mers for the initial seed alignment. This should increase the sensitivity since two errors per 25-mer is less stringent than two errors per 32-mer. However a read is more likely to be repetitive at the 25-mer level than at the 32-mer level, so a decrease in ELAND_SEED_LENGTH should probably be used in conjunction with an increase in ELAND_MAX_MATCHES. Setting this to very low values will drastically slow down the alignment time and will probably result in a lot of poor confidence alignments.
  • ELAND_MAX_MATCHES
    By default eland_extended will consider at most 10 alignments of each read. ELAND_MAX_MATCHES allows this to be varied between 1 and 255.

Processing steps and file formats

ANALYSIS eland_extended is built around the 'eland --multi' output above. Processing steps are as follows.

0. s_N_qraw.txt

Contains an ASCII string for each read encoding the quality values for the called bases for that read. Encoding is ASCII code = qv + 64, so e.g. ASCII 94='^' codes for Q30.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^W^H^^^^
^^^^^^^^^^^^^^^^^^^^^^^^X^S^^^[^^^RA
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

1. s_N_eland_query.txt.

All reads for lane are concatenated into a single fasta file to use as an eland query

>SLXA-B3_604:6:1:533:275
GATTATTCACGATGGGGAACTGGTTTTTGATTTTTT
>SLXA-B3_604:6:1:562:335
TTTCGGTGTTTTCATTTTTTTTTCCTTTGTTAAATC
>SLXA-B3_604:6:1:613:330
AATAATCAGACCGACGATACTAGTGGGACCGTGGTC

2. s_N_eland_multi.txt.

The output of 'eland --multi' for lane N.

>SLXA-B3_604:6:1:533:275 GATTATTCACGATGGGGAACTGGTTTTTGATTTTTT 1:0:0 BAC_plus_vector.fa:153585R0
>SLXA-B3_604:6:1:562:335 TTTCGGTGTTTTCATTTTTTTTTCCTTTGTTAAATC 1:0:0 BAC_plus_vector.fa:68238R0
>SLXA-B3_604:6:1:613:330 AATAATCAGACCGACGATACTAGTGGGACCGTGGTC 0:6:1
BAC_plus_vector.fa:168839R1,168882R1,168968R1,169011R2,169054R1,169097R1,169140R1
NB

When aligning >32 base reads, because only at most 32 bases are used for the alignment, the start positions of alignments to the reverse strand must be corrected by subtracting (read length - 32) from the start position of the alignment. This is corrected for at a later stage but the alignment positions in this file have not been corrected - so you must be aware of this if you are using this file directly.

3. s_N_frag.txt.

This takes the alignment positions (based on at most 32 bases) and does an alignment of the full (possibly >32 base) read to each position. A numeral refers to a run of matching bases, an upper case base or N refers to a base in the reference that differs from the read.

36
34G1
20G15:20G15:20G15:20G9A5:20G15:20G14T:20G15

e.g in the above the first read is a 36-base exact match, but we need to substitute a G into the second read at position 35 (i.e. after 34 matching bases) to match the genome exactly. The third read has multiple matches, most of which have a single G substitution error at position 21.

4. s_N_eland_extended.txt

The information in s_N_eland_multi.txt and s_N_frag.txt is combined here. This corrects the reverse complement matches and incorporates a description of the alignment to the full read. Example (NB field separators are tabs, shown as spaces here for clarity)

>SLXA-B3_604:6:1:533:275 GATTATTCACGATGGGGAACTGGTTTTTGATTTTTT 1:0:0 BAC_plus_vector.fa:153581R36
>SLXA-B3_604:6:1:562:335 TTTCGGTGTTTTCATTTTTTTTTCCTTTGTTAAATC 1:0:0 BAC_plus_vector.fa:68234R34G1
>SLXA-B3_604:6:1:613:330 AATAATCAGACCGACGATACTAGTGGGACCGTGGTC 0:6:1
BAC_plus_vector.fa:168835R20G15,168878R20G15,168964R20G15,169007R20G9A5,169050R20G
15,169093R20G14T,169136R20G15

A read with hits to multiple chromosomes looks like this:

>SLXA-B3_604:6:15:816:354       ACGGTTGAGTAATAAATGGATGCACTGCCTAACCGG    0:0:2  
BAC_plus_vector.fa:163013R23C4G3G3,E_coli.fa:3909838R23C4G3G3

Special cases:

>SLXA-B3_604:6:1:513:360 GAGACACCACCCCCCACCCCCCACCACACTCCCTCC NM -
>SLXA-B3_604:6:1:762:25 NNNNNNNNNNNNNNNNCNNCNNNNGANGTAANNNNA QC -
>SLXA-B3_604:6:1:853:100 ATTTGGGAGGCAAAGGCGGGCTGATTAAGAGGTGAG RM -
>SLXA-B3_604:6:1:513:922 CGGATGCGGCGTAAACGCCTTATCCGGCCCACATCA 0:35:75 -
  • NM: no match found
  • QC: no alignment done - too many "N"s
  • RM: read was repeat masked - only seen if repeat masking feature of ELAND is switched on
  • x:y:z - number of alignments exceeds threshold, no alignment information stored. x, y and z are the number of exact, single-error and double-error matches respectively
    The eland_extended.txt file has the corrected alignment positions and the full alignment descriptions for >32 base reads so is a better choice for use in your own analysis than the s_N_eland_multi.txt file, especially if one is interested in multiple hits per read. However one should be aware that this file is not purity filtered. If a single alignment per read is fine the s_N_export.txt file is the preferred choice.

5. s_N_saf.txt

SAF stands for 'Short Alignment Format.' This file aims to succinctly describe the best alignment for each read (if there is one). The raw quality values are used to attempt to pick the best alignment from the potentially multiple possibilities.

>SLXA-B3_604:6:1:533:275 BAC_plus_vector.fa 153581 R 36 94
>SLXA-B3_604:6:1:562:335 BAC_plus_vector.fa 68234 R 34G1 60
>SLXA-B3_604:6:1:613:330 BAC_plus_vector.fa 168835 R 20G15 0

Special case

>SLXA-B3_604:6:19:365:135 NM

Non-aligning reads may have other codes QC, RM and x:y:z as per the description of s_N_eland_extended.txt
The (tab-separated) fields are as follows

  1. Read ID. This consists of Run ID:Lane:Tile:X:Y
  2. Chromosome
    Handling multiple-entry fasta files
    As of Version 0.3 ELAND copes with fasta files containing multiple entries (therefore so will the pipeline). These get designated chrom1/contig33 and similar.
    Example:
    ec_plus_bac.fa/bCX17K10_79963..162948__bCX98J2_1..81768
    It is best to keep contig names small and avoid exotic ASCII characters within them
  3. Position: always wrt forward strand, starting at 1
  4. Strand: F or R
  5. Alignment descriptor: a number denotes a run of bases same as read; A,C,G,T denote substitution of the relevant base.
    Future Proofing
    Not implemented: a,c,g,t denotes insertion, d denotes deletion)
  6. Alignment QV: a probability of the alignment being wrong - expressed in "Phred style" QVs, ie Q20=1% chance of alignment being wrong. This uses the quality values and the position and type of neighbouring matches.

6. s_N_qcal.txt

This file contains quality values for each base in exactly the same format as s_N_qraw.txt, the difference being that these quality values have been recalibrated using a calibration table derived from the alignments. The algorithm is similar to the Phred algorithm from capillary sequencing

7. s_N_calsaf.txt

CALSAF stands for 'CALibrated Short Alignment Format.' The format is identical to s_N_saf.txt, the difference being that the calibrated quality values are used to pick the best alignment.

8. s_N_export.txt

This file is generated from s_N_calsaf.txt by

  • pasting in the read and calibrated quality values
  • appending Y or N, Y means the read has passed quality filtering
    Example:
    >SLXA-B3_604:6:3:131:386 TTCATTTTCTTACTTTCCGTGAACCTTAAATGAATT ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZQMMMM
    ec_plus_bac.fa/bCX17K10 2 R 36 73 Y
    >SLXA-B3_604:6:12:812:119 ATTCATTTAAGGTTCACGGAAAGTAAGAAAATGAAA ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZQHMMM
    ec_plus_bac.fa/bCX17K10 3 F 36 73 Y
    >SLXA-B3_604:6:7:174:255 ATTCATTTAAGGTTCACGGAAAGTAAGAAAATGAAA ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZQMMMM
    ec_plus_bac.fa/bCX17K10 3 F 36 73 Y
    

9. s_N_sorted.txt

All the file mentioned up until now contain one line per cluster sorted in the standard order of by tile then by position in tile. No filtering has yet been done. s_N_sorted.txt file is obtained from s_N_saf.txt by

  1. Prepending readID, sequence and quality string
  2. Applying the quality filter
  3. Removing all NM,QC,RM reads
  4. Sorting by chromosome then position in chromosome
    Example
>SLXA-B3_604:6:10:221:356 CTTGGAGAATGTTCCATGTGCTAATGAATAGAATGT ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ZXSRNNNN
BAC_plus_vector.fa 11655 R 36 65
>SLXA-B3_604:6:5:107:178 ATTGGAGAATGTTCCATGAGCTAATGAATAGAATGT ^^L^^C^^^^^^^^^^^^DI^^^^K^^^ZXSRNNIN
BAC_plus_vector.fa 11655 R C17T17 0
>SLXA-B3_604:6:10:810:361 CATTCTATTCATTAGCACATGGAACATTCTCCAAGA ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ZXSRNNNN
BAC_plus_vector.fa 11656 F 36 65
Note

There is the potential to increase the sensitivity (at the expense of file size and run time) by

  1. allowing more multiple matches to be stored (at the moment a maximum of 10 is hard-coded into Eland)
  2. decreasing the length of the initial ELAND alignment - the default of 32 means at the moment there can be at most 2 errors in the first 32 bases. Setting ELAND_SEED_LENGTH to 25 (say) allows a less stringent two errors in the first 25 bases. Making the seed length very small will start to have a detrimental effect on the alignment speed, also this feature has the potential to allow users to play with ELAND_SEED_LENGTH until they get answers they like the look of, which I am somewhat ambivalent about.

ANALYSIS eland_pair

'ANALYSIS eland_pair' allows the analysis of a paired read run using ELAND alignments. This mode is based heavily on 'ANALYSIS eland_extended' - to start with it is essentially doing parallel 'eland_extended' analyses for the two reads. s_N_1_qraw.txt s_N_1_eland_query.txt, s_N_1_eland_multi.txt, s_N_1_frag.txt and s_N_1_eland_extended.txt files are produced for read 1 which have identical format and function to their single-read homologues. Similar for read 2.

A script pickBestPair.pl now compares s_N_1_eland_extended.txt and s_N_2_eland_extended.txt along with their quality values and produces two files of alignments s_N_1_saf.txt and s_N_2_saf.txt, which contain pairing information. These are then used to do qulaity value recalibration, generating calibrated quality values s_N_1_qcal.txt and s_N_2_qcal.txt. pickBestPair.pl is then re-run using the calibrated quality values to obtain two further files of alignments s_N_1_calsaf.txt and s_N_2_calsaf.txt. Finally these files are parsed into files s_N_1_export.txt, s_N_2_export.txt, s_N_1_sorted.txt and s_N_2_sorted.txt.

Specifiying USE_BASES for paired reads

The format for the USE_BASES string has changed for pipeline version 0.3 to hopefully make it easier to use and also to accommodate paired reads - see the main GERALD user guide for further details

Configuring a paired read analysis

The alignments of the two reads that provide input to the pairng process may be varied by setting ELAND_SEED_LENGTH and ELAND_MAX_MATCHES. Both parameters may be set lane-by-lane, but the same values will apply to each of the two reads in a lane

The paired read analysis may be configured by passing options to pickBestPair. This is done by setting a parameter PAIR_PARAMS in the GERALD config file.

  • --circular: This causes pickBestPair to treat each chromosome as circular and not linear, enabling it to detect valid pairings that 'wrap around' when the two alignments are mapped onto the linear representation of the chromosome.
    Optional:
    --circular=my_mitochondria_file.fa - treat alignments to my_mitochondria_file.fa as circular but other chromosomes as linear (as you might want to do when e.g. aligning to the whole human genome)
    --circular=chromosome1:100000,chromosome2:300000 - specify chromosomes to circularize and also the size to 'wrap around' (possibly of use when the chromosome size is uncertain)
  • --min-percent-unique-pairs: In this context a "unique pair" is defined as a read pair such that its constituent reads can each be aligned to a unique position in the genome without needing to make use of the fact that they are paired - "individually uniquely aligning" is perhaps a more precise but more verbose definition. pickBestPair works in a two-pass fashion. On the first pass it looks for all clusters that both pass the quality filter and are unique by this definition, using this information to determine the nominal insert size distribution and the relative orientation of the two reads. On a second pass this information is used to resolve repeats and other ambigouous cases.

The number of unique pairs, expressed as a percentage of the total number of clusters passing filters, must exceed a certain percentage, else no pairing is attempted and the two reads are effectively treated as two sets of single reads.
By default, this threshold is set to 30%, but for low quality data a pairing can be forced by setting --min-percent-unique-pairs=5 or somesuch. For some applications it may be useful to switch off the pairing completely, in which case set --min-percent-unique-pairs=101

  • --min-percent-consistent-pairs: Of the unique pairs (i.e. read pairs such that their constituent reads can each be aligned to a unique position in the genome without needing to make use of the fact that they are paired), the vast majority of them should have the same orientation with respect to each other. If they don't it is indicative of either problems sample prep, circularization not being switched on when it should be, or a reference sequence that is extremely diverged from the sample data. In such cased no pairing is attempted and the two reads are effectively treated as two sets of single reads. By default, the threshold for this is set to 70%.
  • --min-paired-read-alignment-score: for each cluster, all possible pairings of alignments between the two reads are compared. This is the score of the best one. Since we are effectively considering the two reads as one, it is logical that both reads in a cluster get the same paired read alignment score. It is nominally on a Phred scale, however it is probably not safe to assume the calibration is perfect. Nevertheless it is a good discriminator between good and bad alignments. The score must exceed this threshold to go in the sorted.txt file - default value is zero.
  • --min-single-read-alignment-score: each read is given a single read alignment score. This is identical to the alignment score it would get from an eland_extended analysis. If a read has a zero paired read alignment score but yet a single read alignment score that exceeds this threshold, its alignment will still go in the sorted.txt files. If a cluster is such that the alignments of the two reads can't be paired (hence giving a zero paired score) and only one of the reads has an alignment exceeding --min-single-read-alignment-score, the read pair is treated as a singleton - ie the alignment of the shadow read is unreliable enough to be ignored.
  • --add-shadow-to-singleton-threshold=1000000
    If a cluster is such that one read has a score exceeding --min-single-read-alignment-score but the second either has no alignments or an alignment that does not exceed --min-single-read-alignment-score, then the non-aligning "shadow" read is added to the sorted.txt file with a zero alignment score if the mean base quality of the shadow read (not alignment quality - there's no alignment, remember) exceeds this threshold.

The default value of 1000000 effectively means this feature is switched off by default.

PAIR_PARAMS can of course be set on a lane-by-lane basis. Note however that all the options must be specified space-separated on a single line, e.g. you must do

8:PAIR_PARAMS --circular --min-percent-unique_pairs=30

and not anything like

8:PAIR_PARAMS --circular
8:PAIR_PARAMS --min-percent-unique_pairs=30

Alignment statistics produced by pickBestPair

pickBestPair dumps paired read alignment stats into a file s_N_pair.xml that is produced for each lane of eland_pair analysis. Most of the useful information therein goes into tables in the Summary.htm web pages,
but a description of the XML file itself is included here for the benefit of developers.

<ReadPairProperties>

A dump of the parameters used to do the pairing.

  <ControlParametersUsed>
    <add-shadow-to-singleton-threshold>1000000</add-shadow-to-singleton-threshold>
    <circular>N</circular>
    <min-paired-read-alignment-score>0</min-paired-read-alignment-score>
    <min-percent-consistent-pairs>70</min-percent-consistent-pairs>
    <min-percent-unique-pairs>8</min-percent-unique-pairs>
    <min-single-read-alignment-score>0</min-single-read-alignment-score>
  </ControlParametersUsed>

Total length of reads, and the subsets of the reads used for the ELAND alignments

  <Length>
    <Read2>
      <Total>38</Total>
      <SeedLengthForELAND>32</SeedLengthForELAND>
    </Read2>
    <Read1>
      <Total>35</Total>
      <SeedLengthForELAND>32</SeedLengthForELAND>
    </Read1>
  </Length>

Only reads with insert sizes of between Max and Min inclusive are considered nominal.

  <InsertSize>
    <HighSD>11</HighSD>
    <LowSD>14</LowSD>
    <Max>234</Max>
    <Median>201</Median>
    <Min>159</Min>
  </InsertSize>

Reads with unique alignments of each pair are calssified according to the relative orientation of the two reads. NominalOrientationPercent must exceed --min-percent-consistent-pairs for pairing to go ahead.

  <Orientation>
    <Fm>2345</Fm>
    <Fp>2199</Fp>
    <Nominal>Rp</Nominal>
    <NominalOrientationButLargeInsert>15383</NominalOrientationButLargeInsert>
    <NominalOrientationButLargeInsertPercent>0.43385</NominalOrientationButLargeInsertPercent>
    <NominalOrientationButSmallInsert>52235</NominalOrientationButSmallInsert>
    <NominalOrientationButSmallInsertPercent>1.47318</NominalOrientationButSmallInsertPercent>
    <NominalOrientationPercent>99.74414</NominalOrientationPercent>
    <Rm>4528</Rm>
    <Rp>3536661</Rp>
  </Orientation>

This gives stats on the unique pairs used to calculate the insert distance. InitialUniquePairsPercent must exceed --min-percent-unique-pairs for pairing to go ahead.

  <Pairs>
    <ClustersPassedFiltering>5424640</ClustersPassedFiltering>
    <ClustersTotal>11885647</ClustersTotal>
    <ClustersUsedToComputeInsert>3545733</ClustersUsedToComputeInsert>
    <InitialUniquePairsPercent>29.83206</InitialUniquePairsPercent>
  </Pairs>

This table breaks down the fate of each cluster according to the fate of its two constituent reads

  <Reads>
    <Read1SingleAlignmentFound>
      <Read2ManyAlignmentsFound>
        <BothAlignButNoFeasiblePair>
          <BothAlignmentsOK>6812</BothAlignmentsOK>
        </BothAlignButNoFeasiblePair>
        <ManyPairedAlignments>3576</ManyPairedAlignments>
        <UniquePairedAlignment>163549</UniquePairedAlignment>
      </Read2ManyAlignmentsFound>
      <Read2NM>
        <NoMatchToEither>1</NoMatchToEither>
        <SingletonRead1>
          <AlignmentOK>748788</AlignmentOK>
        </SingletonRead1>
      </Read2NM>
      <Read2QC>
        <SingletonRead1>
          <AlignmentOK>27343</AlignmentOK>
        </SingletonRead1>
      </Read2QC>
      <Read2Repeat>
        <SingletonRead1>
          <AlignmentOK>2254</AlignmentOK>
        </SingletonRead1>
      </Read2Repeat>
      <Read2SingleAlignmentFound>
        <BothAlignButNoFeasiblePair>
          <BothAlignmentsOK>76669</BothAlignmentsOK>
        </BothAlignButNoFeasiblePair>
        <SingletonRead1>
          <AlignmentOK>16</AlignmentOK>
        </SingletonRead1>
        <SingletonRead2>
          <AlignmentOK>5</AlignmentOK>
        </SingletonRead2>
        <UniquePairedAlignment>3469043</UniquePairedAlignment>
      </Read2SingleAlignmentFound>
    </Read1SingleAlignmentFound>
  <Read1ManyAlignmentsFound>
  ...
  </Read1ManyAlignmentsFound>
  <Read1NM>
  ...
  </Read1NM>
  <Read1QC>
  ...
  </Read1QC>
  <Read1Repeat>
  ...
  </Read1Repeat>
  </Reads>
</ReadPairProperties>

File format descriptions

The script pickBestPair.pl compares s_N_1_eland_extended.txt and s_N_2_eland_extended.txt along with their quality values and produces two files of alignments s_N_1_saf.txt and s_N_2_saf.txt, which are similar in format to their single-read s_N_saf.txt homologue but have three extra fields. Example:

>SLXA-B3_604:4:1:363:727 BAC_plus_vector.fa 40840 F 36 94 94 R 68
>SLXA-B3_604:4:1:434:81 BAC_plus_vector.fa 46936 F 36 94 94 R 55
>SLXA-B3_604:4:1:582:447 BAC_plus_vector.fa 16075 R 36 94 94 F 57
  1. Read ID. This consists of Run ID:Lane:Tile:X:Y-N, where N = 1 for read 1 and N = 2 for read 2
  2. Chromosome
    Allowing multiple-entry fasta files as alignment targets
    From pipeline version 0.3 ELAND and the pipeline allow fasta files with multiple entries to be 'squashed' as alignment targets. Matches to sequence named 'contig33' in file 'chrom1.fa' get designated chrom1.fa/contig33 and similar
  3. Position: always wrt forward strand, starting at 1
  4. Strand: F or R
  5. Alignment descriptor: a number denotes a run of bases same as read; A,C,G,T denote substitution of the relevant base
    Future Proofing
    Not implemented: a,c,g,t denotes insertion, d denotes deletion)
  6. Q s: a probability of the alignment being wrong - expressed in "Phred style" QVs, ie Q20=1% chance of alignment being wrong. This uses the quality values and the position and type of neighbouring matches. This is the single read alignment score, ie ignoring pairing information
  7. Q p: Paired read alignment q-score.
  8. S p: Strand of paired match
  9. off: Distance to paired match. Main point of including this is to allow tracking to mate pair in a sorted file
    Read pairing

    At the moment it is nominal for reads to 'point to each other.' It is possible this may change for other sample preps. pickBestPair.pl has a guess at the nominal orientation of the reads and also computes stats on the distribution of insert sizes - it uses this information decide which pairs are anomalous. It also places this information in a file s_N_pair.xml.

s_N_1_qcal.txt, s_N_2_qcal.txt

These are quality values obtained by calibrating the raw quality values using the alignments, as for s_N_qcal.txt in ANALYSIS eland_extended

s_N_1_calsaf.txt, s_N_2_calsaf.txt

These are identical in format is identical to s_N_1_saf.txt and s_N_2_saf.txt, the difference being that the calibrated quality values are used to pick the best alignment.

8. s_N_1_export.txt, s_N_2_export.txt

These files are generated from s_N_1_calsaf.txt and s_N_2_calsaf.txt by

  • pasting in the read and calibrated quality values
  • appending Y or N, Y means the read has passed quality filtering
    Example:
    >SLXA-B3_604:6:3:131:386 TTCATTTTCTTACTTTCCGTGAACCTTAAATGAATT ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZQMMMM
    ec_plus_bac.fa/bCX17K10 2 R 36 73 Y
    >SLXA-B3_604:6:12:812:119 ATTCATTTAAGGTTCACGGAAAGTAAGAAAATGAAA ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZQHMMM
    ec_plus_bac.fa/bCX17K10 3 F 36 73 Y
    >SLXA-B3_604:6:7:174:255 ATTCATTTAAGGTTCACGGAAAGTAAGAAAATGAAA ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZQMMMM
    ec_plus_bac.fa/bCX17K10 3 F 36 73 Y
    

s_N_1_sorted.txt, s_N_2_sorted.txt

These are obtained from s_N_1_export.txt and s_N_2_export.txt in a manner identical to the way s_N_sorted.txt is obtained from s_N_export.txt in a single-read analysis, namely:

  1. Applying the quality filter
  2. Removing all NM,QC,RM reads
  3. Sorting by chromosome then position in chromosome

Anomalous pairs are sent to a separate file

s_N_anomaly.txt

Reads with no decent pairing are sent here both pairs go in one file. This is got from a precursor s_N_anomraw.txt file that is padded with 'OK' lines to match the line spacing and then purity filtered. The format of each line is

  1. Read ID
  2. Sequence of read 1
  3. Sequence of read 2
  4. Quality string of read 1
  5. Quality string of read 2
  6. All matches for read 1
  7. All matches for read 2
    The match information for the two reads is as per the s_N_eland_extended.txt1 and s_N_eland_extended.txt2 files. Example:
>EAS51_35:1:15:336:755  OK
>EAS51_35:1:15:386:460 TACCCTTCCCTGCAAACGATCTACG CACTGTCACCATACCCAGCTCGTGC 
hhhhhhhhhhhhhhSbhYNe_fIXJ hhhhhhhhhhhhShhhMWhNSCTG` E_coli.fa:4463732R25 E_coli.fa:2175996F25
>EAS51_35:1:15:795:876 AAACGCCTCAGCTTTGCTCCTGTTC CAGCCAGGTTTCCCGTTTCTGCCGG 
hQchhhQhhhhShhhhhhOBh``hh KMhhhhhchhFWh^hhhhhYJG^?W E_coli.fa:1303182R19A5 NM
>EAS51_35:1:15:753:733 ATACTAACAAAAAAAAGCAAAAAAA AAGGATGACACGGAACAGTGTAAGC 
hDhQhhhIV]N[NJMF[A\hCFhaO OhhJh[_hRhT\hdhCPHCUfEJN NM E_coli.fa:1990794F25
>EAS51_35:1:15:90:75 TNNNNNAATCNNNNNGATCACCGAC TGCTGGACTCGGGTCAGCCCAATTT 
J;;;;;Lf\A;;;;;AhNBhIA\DO hhhhhOhbXZhdh[NbBND@KJL QC NM

Line-by-line explanation:

  1. Read aligned with nominal separation between halves - hence no info in this file, it's all in the sorted.txt files
  2. Both halves aligned uniquely but with non-nominal separation and/or orientation
  3. First read aligned, no alignment found for second read
  4. No alignment found for first read, second read did align
  5. First read was not aligned as it contained too many Ns. Second read was aligned but no alignment was found

Algorithm

The tricky thing with paired reads is to account for and behave appropriately with respect to all possible combinations of match scenarios for the two pairs. The diagram below attempts to summarize the decision process for aligning a paired read.

  • As discussed above, pickBestPair will not attempt to do read pairing if either or both of the thresholds --min-percent-unique-pairs or --min-percent-consistent-pairs is not exceeded, because then it judges it is not able to estimate the insert size distribution with confidence. Setting these thresholds to unattainably high values (i.e. greater than 100) is a way of switching off the read-pairing algorithm, if that is desired.
  • For matches where only one half of the pair matches (the matching half is called a 'singleton', its non-aligning counterpart is called a 'shadow'), it is useful for some applications (e.g. reassembly) to add a 'null alignment' of the shadow read into the export.txt files.
    One would wish to exclude shadow reads that did not align purely because of poor base quality, so only reads whose mean base quality exceeds --add-shadow-to-singleton-threshold are added to the build. The default value of 1000000 ensures that this feature is switched off by default - we find it useful but recognise the potential for confusion to users.
    Any 'null alignments' of shadows may be recognised by the fact they get a zero single-read alignment score, an orientation of 'N' and a zero offset. A shadow is given the same alignment position as its counterpart singleton alignment, this ensures that a range query on alignment position will bring back both reads.
  • If the software has successfully aligned both halves of a read as a pair then both reads will have a nonzero paired-read alignment score which will always be greater than zero and always be the same for both reads. In the nominal case, this will be greater than either of the single-read alignment scores of the individual reads. Very occasionally you might see a read for which the paired read alignment score is less than the single read score. this reflects that, even though the read has been placed with confidence, there is more than one position for the read's pair that is feasible given their relative orientation and insert size, thus less confidence about the alignment of the reads considered together as a pair.
  • A zero paired-read alignment score but a non-zero single-read alignment score indicates that no pair of alignments with a feasible relative orientation and insert distance could be found, but that the two reads could nevertheless be individually aligned with some degree of confidence. An alignment with a paired-read alignment score and single-read alignment score that are both zero is a "null alignment" of a "shadow" read, as described above.

paired_read_workflow.png (image/x-png)
Document generated by Confluence on Dec 19, 2007 18:32