Data analysis - documentation : What do the different files in an analysis directory mean
This page last changed on Oct 25, 2007 by rshaw.
Purpose and scope of this pageThis is a guide intended to help analysis users interpret the various graphs that appear in an analysis directory.
IntroductionThe data analysis produces output as a set of text files - with the volume of data produced there now tend to be rather a lot of them. We have tried to summarize the results of an analysis by creating various graphs. These in turn are summarized as web pages that enable a large number of graphs to be viewed as thumbnails, giving a chip-wide view of variations in data quality. These web pages are given standard names. Web pages produced in an experiment directoryGeneral note: most of these pages attempt to show a thumbnail graph for each tile on a chip, as each thumbnail is loaded in as a full size image this is starting to prove somewhat unwieldy for 200-tiles-per-lane experiments. As the number of tiles and the number of requested graphs has increased it has become impractical to generate every possible graph for every tile. The pages here must therefore be considered as just a very basic view of the data that is unlikely to be added to significantly - the Flake tool provides a more interactive and versatile view of the data. Error.htmAlong with the Summary.htm page, this is the page that most users are familiar with. This shows a graph of error rates for each tile on a chip. The red bar shows the percentage of bases at each cycle that are wrong, as estimated from the alignment. "Glitches" such as focus problems manifest themselves as "spikes" in the graph. What good data looks like1-2% or less up to cycle 20 or so. Note on sequence alignments: All sequence alignment at Solexa is done using one of two programs: ELAND and phageAlign. phageAlign allows any number of errors in an alignment and thus provides an accurate picture of the error rate. However, as the name implies it was originally developed when we were resequencing the 5Kb Phi-X 174 genome, it is too slow for aligning against targets bigger than a couple of Mb. ELAND is capable of aligning against the human genome in reasonable time, however it allows at most two errors per fragment. This means that error rates will be underestimated by ELAND alignments, as very poor quality data is simply not aligned, so does not contribute to the error rate graphs or calculations. One must therefore be careful not to draw false conclusions when comparing the quality of data aligned by the two different methods. Hist.htm(No longer generated by default as of Pipeline v. 0.3) Each curve graph on this graph is a histogram showing the "spread" in read quality. The red curve is the quality distribution of all reads, green for those that pass filtering. Blue is the quality distribution estimated from the quality values produced Read quality is measured using the average quality of the bases in a read. 20 corresponds to a 1% error rate, 30 to 0.1% error rate. What good data looks likeA high peak (lots of sequences), peak near right hand side of graph (good quality) a tight spread (consistent quality). Green peak much smaller than red peak indicates either over zealous filtering or poor quality raw data. Info.htm(No longer generated by default as of Pipeline v. 0.3) This attempts to quantify the notion of throughput in a tile. Any set of bases containing errors carries the same amount of information as a smaller number of perfect (error-free) bases. The number of bases per cycle actually sequenced is shown in black. Their information content is shown in red, and after filtering shown in green. What good data looks likeA horizontal (no drop in information content a.k.a. data quality) red line that is near to the black line (good initial data quality) and near to the top of the page (high information content). Launch spec is something like 14000 clusters per tile at 1-2% error rate. This is off the top of the graph at the current scale. Green curve too far from red indicates a lot of information has been lost by filtering All.htmGives a tile by tile representation of the mean matrix adjusted intensity of clusters plotted as a function of cycle. It plots each channel (ACGT) separately as a different coloured line. Means are calculated over all clusters, irrespective of base calling. Thus if all (or most) clusters are "T", channels "A", "C", & "G" will be at (or near) zero. If all bases are present in the sample at 25%, a well-balanced matrix should show all channels having similar intensities. If intensities are not all equal, it could be indicative of either poor cross-talk correction or poor absolute intensity balance between each channel. Perfect.htmThis graph shows the proportion of reads in a tile that have 0, 1, 2, 3 or 4 errors by the time they get to a given cycle. What good data looks likeAs many zero reads as possible at as late a cycle as possible - the more grey the better, basically! IVC.htmThere are several plots here, all produced as lane averages over all tiles in the lane. All: This is the lane average of the data displayed in All.htm. It plots each channel (ACGT) separately as a different coloured line. Means are calculated over all clusters, irrespective of base calling. Thus if all (or most) clusters are "T", channels "A", "C", & "G" will be at (or near) zero. If all bases are present in the sample at 25%, a well-balanced matrix should show all channels having similar intensities. If intensities are not all equal, it could be indicative of either poor cross-talk correction or absolute intensity balance between each channel. Called: Similar to All, but means are calculated for each channel only using those clusters that the base-caller has called in that channel. Thus for a sample with equal base representation (25% of each) and with pure signal (zero intensity in the non-called channels), the Called intensity will be 4 times that of All, as the intensities will only be averaged over 25% of the clusters. For impure clusters, the difference in intensity will be less than 4. The Calledintensities should be independent of base representation of the sample, so a well balance matrix should show all channels as having equal intensity. This can lead to some confusion in interpretation of plots of monotemplate data. A single cluster or spec of 'dust' that is called in a different channel could be plotted as a very large intensity, even though the All and %Base_Calls look to be showing zero. It is thus often better to look at the All plots to get intensities for monotemplate data. Deblock: If deblock images have been analysed then Deblockintensities plots will be displayed. These show the average deblock intensity as a function of cycle. Due to the way image analysis picks out the brightest pixel in the vicinity of each cluster, there will always be a positive signal here, even if the deblocking was perfect. Other than very high intensities, the thing to look out for is intensities following the sequence of a monotemplate. This could be indicative of incomplete cleavage of the dye (a problem), though it could be cleaved dye that has not been fully washed away and may go during the next cycle of enzymology fluidics. %Base_Calls: The percentage of each base called as a function of cycle. Ideally this should be constant for a genomic sample, reflecting the base representation of the sample. In practice, at later cycles it is often the case that some bases start to be favoured over others. As the signal decays, some bases may start to fall into the noise while other still rise above it. Matrix adjustments may help if data needs to be optimised. For monotemplate data this should show 100% for the correct base at each cycle, thus deviations from this are a measure of errors. %All and %Called: Exactly the same as All and Called but expressed as a percentage of the summed intensities. Makes it easier to see changes in relative intensities between channels as a function of cycle by removing any intensity decay. Monotemplate.htmThis page will only be seen if the run folder contains at least one lane of what is confusingly termed 'monotemplate' data ('mixed template' is probably a better name). Monotemplate lanes contain a mixture of a small number of DNA templates, we therefore expect each read to be one of the template sequences. A mixture of 4 is often used, in this case we expect each read to be one of four different sequences. It produces a separate set of plots for each of the template sequences in each lane. The 'All' and 'Called' plots are analogous to the eponymous plots in the IVC.htm page, except they are specific to a template sequence. The '% error' shows the percent of errors at each cycle and the '% perfect' shows the distribution of 0,1,2,3 and 4 error reads by cycle (as for the Perfect.htm page), both specific to a particular template. Tile.htmDraws thumbnails of all graphs in a directory, placing all graphs pertaining to a particular tile on the same row. This enables different aspects of a tile's quality (or otherwise) to be seen in one "view." NB due to the potentially large amount of graphs you may have trouble opening this page, it probably ought to be split into separate files by channel, or maybe only show graphs for a subset of tiles. Summary.htmThis is entirely text based (so will load in reasonable time even for a large experiment) and contains links to all the pages mentioned so far. Tables in a Summary.htm page: Chip Summary: Information about the machine and flowcell Chip Results Summary: Summary chip-wide performance statistics for the run (raw and filtered cluster counts, yield) Lane Parameters Summary: Information about the sample in each lane and the analysis performed on it Lane Results Summary: Basic data quality metrics for each lane If eland_pair analysis has been specified for one or more lanes, then there will be two such summaries - one for each read. All lanes for which analysis has been specified will be represented in `Expanded Lane Summary : Read 1' but only those for which eland_pair analysis has been specified will contribute statistics to the Read 2 table. Expanded Lane Summary: More detailed quality metrics for each lane In the same manner as for the Lane Results Summary, specification of eland_pair analysis for any of the lanes will result in two Expanded Lane Summary tables - displaying statistics for Read 1 and (for lanes where it exists) Read 2. There then follows a table for each lane, each containing one line of quality information for each tile in that lane. For monotemplate lanes, this per-tile summary table is followed by an additional 'Monotemplate Summary' table for that lane. Pair Summary: For lanes for which eland_pair analysis was performed, there will be two per-tile summary tables (one for each read) and these will be preceded by a set of tables collectively entitled the `Pair Summary'. These provide statistics about the alignment outcomes of the two reads individually and as a pair, the latter including relative orientation and separation (insert size) of partner read alignments. Bonus feature for geeks: although the Summary.htm file is an HTML file, it is also a valid XML file (or is at least parseable by Perl's XML::Simple module). This provides a means for mining the numbers in a Summary.htm file (or multiple Summary.htm files) via a Perl script. Files produced for each lane on a chips_3_sequence.txtThis file contains all sequences in a lane of a chip in a format designed to be exportable. The content of this file is affected by the following parameters.
s_3_qreport.txtThis is a text file that reports the accuracy of the instrument quality values, making use of the _qraw.txt files. s_3_qcalreport.txtThis is a text file that reports the accuracy of the recalibrated quality values, making use of the _qcal.txt files (if they are present for the type of analysis you have specified). Its format is identical to s_3_qreport.txt. Files produced for each tile of a chipAll files pertaining to e.g. tile 23 of lane 3 have names starting with s_3_0023... (Note for Windows users: You will want to open text files in Wordpad not Notepad as Notepad doesn't display them properly - the Windows carriage returns are missing at the end of lines.) As of Version 1.8 of GERALD.pl, several new files are produced. These properly separate the quality filtering and realignment stages and also facilitate the filtering out of contaminant sequences. Contaminant filtering is switched on by specifying a file of contaminant sequences as CONTAM_FILE in the config file. Certain of the files below are only produced if contaminant filtering is switched on. s_3_0023_align.txtFirst pass alignments for a tile s_3_0023_calign.txtFirst pass alignments of the sequences in the tile against the file of contaminant sequences specified in CONTAM_FILE in the Make. If CONTAM_FILE is not specified this file is not produced. s_3_0023_cdiff.txtOnly produced if CONTAM_FILE is specified. Difference in alignment scores of alignment to data versus alignment to contaminant file. If negative, corresponding sequence aligns better to contaminant than to data. s_3_0023_score.txtError rate information from first pass alignments. Error rate information is contained in text form in here (handy for pasting into Excel), also a list of most common sequences that can be useful for spotting potential contaminants. If CONTAM_FILE is specified, sequences having a negative entry in the s_3_0023_cdiff.txt file (i.e. likely contaminants) are ignored. s_3_0023_qhg.txtThese files get produced at the Bustard stage in the latest pipeline versions. s_3_0023_prealign.txtRealignment of all sequences against the data, using the error rate information in s_3_0023_score.txt to refine the alignment by re-weighting each base at each cycle according to its confidence. ELAND does not (yet) have the feature to weight the contribution of bases in an alignment, so if the lane in question was analysed using ELAND this file is just a copy of s_3_0023_align.txt s_3_0023_realign.txtConsists of alignments in s_3_0023_prealign.txt, filtered to exclude alignments for those clusters that do not pass the quality criterion QF_PARAMS when applied to s_3_0023_qhg.txt. Prior to GERALD v1.8 the realignment and filtering was done in a single step (i.e. no intermediate prealign file was produced). Note that even if contaminant filtering is switched on the alignments here will NOThave been contaminant filtered. This can be done on the fly using the s_3_0023_crediff.txt and qualityFilter.pl. To retain non-contaminants only use cat s_3_0023_realign.txt | qualityFilter.pl '($F[0]>0)' s_3_0023_crediff.txt | ... Replace '>' with '<=' to retain contaminants only. s_3_0023_crealign.txtOnly produced if CONTAM_FILE is specified. Contains realignments of the sequences in the tile against the file of contaminant sequences specified in CONTAM_FILE, using the error rate information in s_3_0023_score.txt to refine the alignment by re-weighting each base at each cycle according to its confidence. s_3_0023_cprediff.txtOnly produced if CONTAM_FILE is specified. Contains differences in alignment scores of realignment to data (in s_3_0023_prealign.txt) versus realignment to contaminant file (in s_3_0023_crealign.txt). If negative, corresponding sequence aligns better to contaminant than to data. Analogous to s_3_0023_cdiff.txt. s_3_0023_crediff.txtOnly produced if CONTAM_FILE is specified. Contains differences in alignment scores of realignment to data (in s_3_0023_prealign.txt) versus alignment to contaminant file (in s_3_0023_crealign.txt), filtered to exclude alignments for those clusters that do not pass the quality criterion QF_PARAMS when applied to s_3_0023_qhg.txt. This is based on s_3_0023_cprediff.txt, filtered so as to have a line-to-line correspondence with the realignments in s_3_0023_realign.txt s_3_0023_rescore.txtImproved estimate of the error rate based on s_3_0023_realign.txt. If CONTAM_FILE is specified, the calculation ignores sequences having a negative entry in the s_3_0023_crediff.txt file (i.e. likely contaminants). s_3_0023_qalign.txtAlignment done using base quality values to weight the bases. Not produced if the alignments for the lane in question were generated from an ELAND analysis, as ELAND does not (yet) have the feature to weight bases by their quality values. s_3_0023_rescore.pngViewable error rate graph drawn from s_3_0023_rescore.txt. This is used as a thumbnail in Error.htm Files relating to quality valuesThe Bustard base caller produces files named s_3_0023_prb.txt and similar that sit in the directory above the analysis directory. For an explanation of the quality scoring scheme go to Alignment Scoring Guide and FAQ. In the analysis folder you will see the following s_3_0023_qraw.txtThe file s_3_0023_prb.txt contains four scores for each base, the highest of these scores is the score pertaining to the called base. s_3_0023_qraw.txt collates these.
s_3_0023_qcal.txtIn ANALYSIS default and ANALYSIS eland mode, the s_3_0023_align.txt and intensity information are used to perform a recalibration of the quality value - details of the recalibration procedure may be found in Base-call calibration. This file contains the recalibrated quality values in a format identical to s_3_0023_qraw.txt. s_3_0023_qval.txt, s_3_0023_qtable.txtThese are intermediate files produced during the generation of s_3_0023_qcal.txt. Normally they are deleted when the analysis is completed, but may be present in an analysis folder if the analysis was interrupted for any reason. config.txt, config.xml, MakefileThese generally will not form part of public releases of Solexa data as they are essentially admin files, however they will be seen in the analysis folder when a run has finished. The config.txt file specifies what analysis should be done for each lane, this is translated by a program called GERALD into a Makefile, which is a recipe that specifies exactly what commands should be executed to carry out the requested analysis. The config.txt file used to generate an analysis is copied to the analysis folder so it can be passed to GERALD, after modification if necessary, if a reanalysis of the same data is required. The config.txt file is being replaced by an XML file config.xml that performs the same function, for the moment both files are present in a run folder. |
![]() |
Document generated by Confluence on Jul 25, 2008 16:42 |