Data analysis - documentation : New analysis modes for GERALD in Pipeline Version 0.3
This page last changed on Dec 11, 2007 by ac.
Executive SummaryThere are two new modes - eland_extended and eland_pair.
The important data is contained in one file of each type per lane (or two files per lane in most cases for eland_pair mode). The output files users should be most interested in are:
Also, both modes assign a quality score to each alignment, which is obtained by combining base quality information with the existence of similar alignments. More details below. Existing (pipeline release <=0.2.2.6) GERALD analysis modesThe version of GERALD in the v0.2.2.6 pipeline release may be run in the following four modes: ANALYSIS defaultAlign to reference sequence using phageAlign. ANALYSIS monotemplateAlign to set of tags with phageAlign. ANALYSIS expressionAlign to set of tags with phageAlign. ANALYSIS elandAlign to (large) reference sequence using ELAND
New (pipeline release >=0.3) GERALD analysis modesThe export.txt fileBoth eland_extended and eland_pair share a common export.txt file format that contains all read, quality value and alignment information for a lane of data. However an eland_pair analysis of a lane will produce two such files (named e.g. s_6_1_export.txt and s_6_2_export.txt), one for each of the two reads, whereas an eland_extended analysis will produce a single such file per lane (named e.g. s_6_export.txt). Moreover not all the fields are relevant to a single-read analysis so will be left blank. The format is meant to be easily exportable to databases.
Format of export.txt fileDemocracy is necessarily despotism, as it establishes an executive power contrary to the general will; all being able to decide against one whose opinion may differ, the will of all is therefore not that of all: which is contradictory and opposite to liberty. - Immanuel Kant The number of fields per line remains a constant 22, any not relevant to a particular read are left blank (the empty string ""). In particular, for a single-read analysis the Read Number, Paired Read Alignment Score and Partner Chromosome/Contig/Offset/Strand fields will all be blank
Multiple-match ELANDBoth new modes are built around the new multiple match ELAND output obtained by specifying --multi on the command line for ELAND. Example of raw ELAND outputJust for interest, here are examples of the same data aligned using "standard" and multiple-match ELAND. Note carefully that both eland_pair and eland_extended treat the multiple-match ELAND output as raw data and apply considerable post-processing to it. For export into a user's own application other files - most notably the export.txt format - should be preferred over the "raw" ELAND output, in particular because the raw ELAND output has not had signal purity filtering applied. Standard ELAND format (separating tabs changed to spaces for clarity) >SLXA-B3_604:6:1:533:275 GATTATTCACGATGGGGAACTGGTTTTTGATTTTTT U0 1 0 0 BAC_plus_vector.fa 153585 R .. >SLXA-B3_604:6:1:562:335 TTTCGGTGTTTTCATTTTTTTTTCCTTTGTTAAATC U0 1 0 0 BAC_plus_vector.fa 68238 R .. >SLXA-B3_604:6:1:613:330 AATAATCAGACCGACGATACTAGTGGGACCGTGGTC R1 0 6 1 ELAND --multi format >SLXA-B3_604:6:1:533:275 GATTATTCACGATGGGGAACTGGTTTTTGATTTTTT 1:0:0 BAC_plus_vector.fa:153585R0 >SLXA-B3_604:6:1:562:335 TTTCGGTGTTTTCATTTTTTTTTCCTTTGTTAAATC 1:0:0 BAC_plus_vector.fa:68238R0 >SLXA-B3_604:6:1:613:330 AATAATCAGACCGACGATACTAGTGGGACCGTGGTC 0:6:1 BAC_plus_vector.fa:168839R1,168882R1,168968R1,169011R2,169054R1,169097R1,169140R1 The read IDEach read is given a unique identifier string, which is used to identify the read in most of the intermediate files mentioned below. >SLXA-B3_604:6:1:533:275 denotes an unpaired read. It is from lane 6 tile 1 of run 604 on machine SLXA-B3, and the (X,Y) coordinates of the cluster on the tile (in pixel units) are (533,275). Taken together these data are sufficient to uniquely identify a cluster - importantly for multi-run projects, this uniqueness should extend across runs, although this is of course reliant upon machine naming and run numbering being done in a sensible way. Furthermore >SLXA-B3_604:6:1:533:275/1 would denote read 1 of a paired read and >SLXA-B3_604:6:1:533:275/2 denotes read 2 of a paired read. Note that for the final export.txt format the read ID is split into tab separated fields. Handling indexingTo support possible future applications, pipeline version 0.3 allows a subset of bases in the read to be specified as index bases (otherwise known as a bar code). These bases are removed from the read and incorporated into the read ID, they play no other part in the analysis. However, having the index string embedded in the read ID in this way gives the potential for reads and/or alignments to be easily separated by index for the purposes of e.g. SNP calling or assembly. Index bases are specified by an extension to the syntax for the USE_BASES parameter in the GERALD config file - see the user guide for details. Single read example: Again note that for the final export.txt format the read ID is split into tab separated fields, of which the index string is one. ANALYSIS eland_extendedThis is an upgraded version of the 'ANALYSIS eland' mode. From pipeline release >=0.2.2 onwards 'ANALYSIS eland' supports >32 base reads in a manner that owing to time pressure is somewhat rudimentary. Idea of eland_extended was to improve this support and also prototype ideas and file formats for ANALYSIS eland_pair (see below). The intention for pipeline 0.3 is to leave 'ANALYSIS eland' exactly as is for users who have built its file formats etc. into their infrastructure, but encourage users to migrate to 'ANALYSIS eland_extended.' 'ANALYSIS eland' can align reads longer than 32 bases but demands that the first 32 bases of the read have a unique best match in the genome. The position of this match is used as a 'seed' to extend the match along the full length of the read. ANALYSIS eland_extended removes the uniqueness restriction by allowing multiple 32-mer matches to be considered and extended. The length of the initial 'seed' alignment can also be varied from the default 32 bases. Configuring ANALYSIS eland_extendedThere are two parameters that affect the output of the alignment, both can be specified per-lane of course.
Processing steps and file formatsANALYSIS eland_extended is built around the 'eland --multi' output above. Processing steps are as follows. 0. s_N_qraw.txtContains an ASCII string for each read encoding the quality values for the called bases for that read. Encoding is ASCII code = qv + 64, so e.g. ASCII 94='^' codes for Q30. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^W^H^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^X^S^^^[^^^RA ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1. s_N_eland_query.txt.All reads for lane are concatenated into a single fasta file to use as an eland query >SLXA-B3_604:6:1:533:275 GATTATTCACGATGGGGAACTGGTTTTTGATTTTTT >SLXA-B3_604:6:1:562:335 TTTCGGTGTTTTCATTTTTTTTTCCTTTGTTAAATC >SLXA-B3_604:6:1:613:330 AATAATCAGACCGACGATACTAGTGGGACCGTGGTC 2. s_N_eland_multi.txt.The output of 'eland --multi' for lane N. >SLXA-B3_604:6:1:533:275 GATTATTCACGATGGGGAACTGGTTTTTGATTTTTT 1:0:0 BAC_plus_vector.fa:153585R0 >SLXA-B3_604:6:1:562:335 TTTCGGTGTTTTCATTTTTTTTTCCTTTGTTAAATC 1:0:0 BAC_plus_vector.fa:68238R0 >SLXA-B3_604:6:1:613:330 AATAATCAGACCGACGATACTAGTGGGACCGTGGTC 0:6:1 BAC_plus_vector.fa:168839R1,168882R1,168968R1,169011R2,169054R1,169097R1,169140R1
3. s_N_frag.txt.This takes the alignment positions (based on at most 32 bases) and does an alignment of the full (possibly >32 base) read to each position. A numeral refers to a run of matching bases, an upper case base or N refers to a base in the reference that differs from the read. 36 34G1 20G15:20G15:20G15:20G9A5:20G15:20G14T:20G15 e.g in the above the first read is a 36-base exact match, but we need to substitute a G into the second read at position 35 (i.e. after 34 matching bases) to match the genome exactly. The third read has multiple matches, most of which have a single G substitution error at position 21. 4. s_N_eland_extended.txtThe information in s_N_eland_multi.txt and s_N_frag.txt is combined here. This corrects the reverse complement matches and incorporates a description of the alignment to the full read. Example (NB field separators are tabs, shown as spaces here for clarity) >SLXA-B3_604:6:1:533:275 GATTATTCACGATGGGGAACTGGTTTTTGATTTTTT 1:0:0 BAC_plus_vector.fa:153581R36 >SLXA-B3_604:6:1:562:335 TTTCGGTGTTTTCATTTTTTTTTCCTTTGTTAAATC 1:0:0 BAC_plus_vector.fa:68234R34G1 >SLXA-B3_604:6:1:613:330 AATAATCAGACCGACGATACTAGTGGGACCGTGGTC 0:6:1 BAC_plus_vector.fa:168835R20G15,168878R20G15,168964R20G15,169007R20G9A5,169050R20G 15,169093R20G14T,169136R20G15 A read with hits to multiple chromosomes looks like this: >SLXA-B3_604:6:15:816:354 ACGGTTGAGTAATAAATGGATGCACTGCCTAACCGG 0:0:2 BAC_plus_vector.fa:163013R23C4G3G3,E_coli.fa:3909838R23C4G3G3 Special cases: >SLXA-B3_604:6:1:513:360 GAGACACCACCCCCCACCCCCCACCACACTCCCTCC NM - >SLXA-B3_604:6:1:762:25 NNNNNNNNNNNNNNNNCNNCNNNNGANGTAANNNNA QC - >SLXA-B3_604:6:1:853:100 ATTTGGGAGGCAAAGGCGGGCTGATTAAGAGGTGAG RM - >SLXA-B3_604:6:1:513:922 CGGATGCGGCGTAAACGCCTTATCCGGCCCACATCA 0:35:75 -
5. s_N_saf.txtSAF stands for 'Short Alignment Format.' This file aims to succinctly describe the best alignment for each read (if there is one). The raw quality values are used to attempt to pick the best alignment from the potentially multiple possibilities. >SLXA-B3_604:6:1:533:275 BAC_plus_vector.fa 153581 R 36 94 >SLXA-B3_604:6:1:562:335 BAC_plus_vector.fa 68234 R 34G1 60 >SLXA-B3_604:6:1:613:330 BAC_plus_vector.fa 168835 R 20G15 0 Special case >SLXA-B3_604:6:19:365:135 NM Non-aligning reads may have other codes QC, RM and x:y:z as per the description of s_N_eland_extended.txt
6. s_N_qcal.txtThis file contains quality values for each base in exactly the same format as s_N_qraw.txt, the difference being that these quality values have been recalibrated using a calibration table derived from the alignments. The algorithm is similar to the Phred algorithm from capillary sequencing 7. s_N_calsaf.txtCALSAF stands for 'CALibrated Short Alignment Format.' The format is identical to s_N_saf.txt, the difference being that the calibrated quality values are used to pick the best alignment. 8. s_N_export.txtThis file is generated from s_N_calsaf.txt by
9. s_N_sorted.txtAll the file mentioned up until now contain one line per cluster sorted in the standard order of by tile then by position in tile. No filtering has yet been done. s_N_sorted.txt file is obtained from s_N_saf.txt by
>SLXA-B3_604:6:10:221:356 CTTGGAGAATGTTCCATGTGCTAATGAATAGAATGT ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ZXSRNNNN BAC_plus_vector.fa 11655 R 36 65 >SLXA-B3_604:6:5:107:178 ATTGGAGAATGTTCCATGAGCTAATGAATAGAATGT ^^L^^C^^^^^^^^^^^^DI^^^^K^^^ZXSRNNIN BAC_plus_vector.fa 11655 R C17T17 0 >SLXA-B3_604:6:10:810:361 CATTCTATTCATTAGCACATGGAACATTCTCCAAGA ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ZXSRNNNN BAC_plus_vector.fa 11656 F 36 65
ANALYSIS eland_pair'ANALYSIS eland_pair' allows the analysis of a paired read run using ELAND alignments. This mode is based heavily on 'ANALYSIS eland_extended' - to start with it is essentially doing parallel 'eland_extended' analyses for the two reads. s_N_1_qraw.txt s_N_1_eland_query.txt, s_N_1_eland_multi.txt, s_N_1_frag.txt and s_N_1_eland_extended.txt files are produced for read 1 which have identical format and function to their single-read homologues. Similar for read 2. A script pickBestPair.pl now compares s_N_1_eland_extended.txt and s_N_2_eland_extended.txt along with their quality values and produces two files of alignments s_N_1_saf.txt and s_N_2_saf.txt, which contain pairing information. These are then used to do qulaity value recalibration, generating calibrated quality values s_N_1_qcal.txt and s_N_2_qcal.txt. pickBestPair.pl is then re-run using the calibrated quality values to obtain two further files of alignments s_N_1_calsaf.txt and s_N_2_calsaf.txt. Finally these files are parsed into files s_N_1_export.txt, s_N_2_export.txt, s_N_1_sorted.txt and s_N_2_sorted.txt.
Configuring a paired read analysisThe alignments of the two reads that provide input to the pairng process may be varied by setting ELAND_SEED_LENGTH and ELAND_MAX_MATCHES. Both parameters may be set lane-by-lane, but the same values will apply to each of the two reads in a lane The paired read analysis may be configured by passing options to pickBestPair. This is done by setting a parameter PAIR_PARAMS in the GERALD config file.
The number of unique pairs, expressed as a percentage of the total number of clusters passing filters, must exceed a certain percentage, else no pairing is attempted and the two reads are effectively treated as two sets of single reads.
The default value of 1000000 effectively means this feature is switched off by default. PAIR_PARAMS can of course be set on a lane-by-lane basis. Note however that all the options must be specified space-separated on a single line, e.g. you must do 8:PAIR_PARAMS --circular --min-percent-unique_pairs=30 and not anything like 8:PAIR_PARAMS --circular 8:PAIR_PARAMS --min-percent-unique_pairs=30 Alignment statistics produced by pickBestPairpickBestPair dumps paired read alignment stats into a file s_N_pair.xml that is produced for each lane of eland_pair analysis. Most of the useful information therein goes into tables in the Summary.htm web pages, <ReadPairProperties> A dump of the parameters used to do the pairing. <ControlParametersUsed> <add-shadow-to-singleton-threshold>1000000</add-shadow-to-singleton-threshold> <circular>N</circular> <min-paired-read-alignment-score>0</min-paired-read-alignment-score> <min-percent-consistent-pairs>70</min-percent-consistent-pairs> <min-percent-unique-pairs>8</min-percent-unique-pairs> <min-single-read-alignment-score>0</min-single-read-alignment-score> </ControlParametersUsed> Total length of reads, and the subsets of the reads used for the ELAND alignments <Length> <Read2> <Total>38</Total> <SeedLengthForELAND>32</SeedLengthForELAND> </Read2> <Read1> <Total>35</Total> <SeedLengthForELAND>32</SeedLengthForELAND> </Read1> </Length> Only reads with insert sizes of between Max and Min inclusive are considered nominal. <InsertSize> <HighSD>11</HighSD> <LowSD>14</LowSD> <Max>234</Max> <Median>201</Median> <Min>159</Min> </InsertSize> Reads with unique alignments of each pair are calssified according to the relative orientation of the two reads. NominalOrientationPercent must exceed --min-percent-consistent-pairs for pairing to go ahead. <Orientation> <Fm>2345</Fm> <Fp>2199</Fp> <Nominal>Rp</Nominal> <NominalOrientationButLargeInsert>15383</NominalOrientationButLargeInsert> <NominalOrientationButLargeInsertPercent>0.43385</NominalOrientationButLargeInsertPercent> <NominalOrientationButSmallInsert>52235</NominalOrientationButSmallInsert> <NominalOrientationButSmallInsertPercent>1.47318</NominalOrientationButSmallInsertPercent> <NominalOrientationPercent>99.74414</NominalOrientationPercent> <Rm>4528</Rm> <Rp>3536661</Rp> </Orientation> This gives stats on the unique pairs used to calculate the insert distance. InitialUniquePairsPercent must exceed --min-percent-unique-pairs for pairing to go ahead. <Pairs> <ClustersPassedFiltering>5424640</ClustersPassedFiltering> <ClustersTotal>11885647</ClustersTotal> <ClustersUsedToComputeInsert>3545733</ClustersUsedToComputeInsert> <InitialUniquePairsPercent>29.83206</InitialUniquePairsPercent> </Pairs> This table breaks down the fate of each cluster according to the fate of its two constituent reads <Reads> <Read1SingleAlignmentFound> <Read2ManyAlignmentsFound> <BothAlignButNoFeasiblePair> <BothAlignmentsOK>6812</BothAlignmentsOK> </BothAlignButNoFeasiblePair> <ManyPairedAlignments>3576</ManyPairedAlignments> <UniquePairedAlignment>163549</UniquePairedAlignment> </Read2ManyAlignmentsFound> <Read2NM> <NoMatchToEither>1</NoMatchToEither> <SingletonRead1> <AlignmentOK>748788</AlignmentOK> </SingletonRead1> </Read2NM> <Read2QC> <SingletonRead1> <AlignmentOK>27343</AlignmentOK> </SingletonRead1> </Read2QC> <Read2Repeat> <SingletonRead1> <AlignmentOK>2254</AlignmentOK> </SingletonRead1> </Read2Repeat> <Read2SingleAlignmentFound> <BothAlignButNoFeasiblePair> <BothAlignmentsOK>76669</BothAlignmentsOK> </BothAlignButNoFeasiblePair> <SingletonRead1> <AlignmentOK>16</AlignmentOK> </SingletonRead1> <SingletonRead2> <AlignmentOK>5</AlignmentOK> </SingletonRead2> <UniquePairedAlignment>3469043</UniquePairedAlignment> </Read2SingleAlignmentFound> </Read1SingleAlignmentFound> <Read1ManyAlignmentsFound> ... </Read1ManyAlignmentsFound> <Read1NM> ... </Read1NM> <Read1QC> ... </Read1QC> <Read1Repeat> ... </Read1Repeat> </Reads> </ReadPairProperties> File format descriptionsThe script pickBestPair.pl compares s_N_1_eland_extended.txt and s_N_2_eland_extended.txt along with their quality values and produces two files of alignments s_N_1_saf.txt and s_N_2_saf.txt, which are similar in format to their single-read s_N_saf.txt homologue but have three extra fields. Example: >SLXA-B3_604:4:1:363:727 BAC_plus_vector.fa 40840 F 36 94 94 R 68 >SLXA-B3_604:4:1:434:81 BAC_plus_vector.fa 46936 F 36 94 94 R 55 >SLXA-B3_604:4:1:582:447 BAC_plus_vector.fa 16075 R 36 94 94 F 57
s_N_1_qcal.txt, s_N_2_qcal.txtThese are quality values obtained by calibrating the raw quality values using the alignments, as for s_N_qcal.txt in ANALYSIS eland_extended s_N_1_calsaf.txt, s_N_2_calsaf.txtThese are identical in format is identical to s_N_1_saf.txt and s_N_2_saf.txt, the difference being that the calibrated quality values are used to pick the best alignment. 8. s_N_1_export.txt, s_N_2_export.txtThese files are generated from s_N_1_calsaf.txt and s_N_2_calsaf.txt by
s_N_1_sorted.txt, s_N_2_sorted.txtThese are obtained from s_N_1_export.txt and s_N_2_export.txt in a manner identical to the way s_N_sorted.txt is obtained from s_N_export.txt in a single-read analysis, namely:
Anomalous pairs are sent to a separate file s_N_anomaly.txtReads with no decent pairing are sent here both pairs go in one file. This is got from a precursor s_N_anomraw.txt file that is padded with 'OK' lines to match the line spacing and then purity filtered. The format of each line is
>EAS51_35:1:15:336:755 OK >EAS51_35:1:15:386:460 TACCCTTCCCTGCAAACGATCTACG CACTGTCACCATACCCAGCTCGTGC hhhhhhhhhhhhhhSbhYNe_fIXJ hhhhhhhhhhhhShhhMWhNSCTG` E_coli.fa:4463732R25 E_coli.fa:2175996F25 >EAS51_35:1:15:795:876 AAACGCCTCAGCTTTGCTCCTGTTC CAGCCAGGTTTCCCGTTTCTGCCGG hQchhhQhhhhShhhhhhOBh``hh KMhhhhhchhFWh^hhhhhYJG^?W E_coli.fa:1303182R19A5 NM >EAS51_35:1:15:753:733 ATACTAACAAAAAAAAGCAAAAAAA AAGGATGACACGGAACAGTGTAAGC hDhQhhhIV]N[NJMF[A\hCFhaO OhhJh[_hRhT\hdhCPHCUfEJN NM E_coli.fa:1990794F25 >EAS51_35:1:15:90:75 TNNNNNAATCNNNNNGATCACCGAC TGCTGGACTCGGGTCAGCCCAATTT J;;;;;Lf\A;;;;;AhNBhIA\DO hhhhhOhbXZhdh[NbBND@KJL QC NM Line-by-line explanation:
AlgorithmThe tricky thing with paired reads is to account for and behave appropriately with respect to all possible combinations of match scenarios for the two pairs. The diagram below attempts to summarize the decision process for aligning a paired read.
|
![]() |
Document generated by Confluence on Dec 19, 2007 18:32 |