Data analysis - documentation : Pipeline output and visualisation
This page last changed on Aug 14, 2007 by maising.
The analysis pipeline uses an open, text-file based architecture. As a result, you will have access to many different pieces of intermediate information. This page is meant to describe the most important pieces of information that you will most likely want to look at. Sequences and quality scoresAt the most basic level, the pipeline provides a set of sequences and base-specific quality scores. The sequences are bundled by lane and come in one of several text formats; the choice of format is configurable by the user. The currently supported formats are:
The output files are called s_1_sequence.txt, s_2_sequence.txt, ... and can be found in the GERALD folder of a finished analysis run. More information can be found in What do the different files in an analysis directory mean. AlignmentsIf a reference file has been specified, the analysis can produce a set of "alignment files" that indicate the position of the best match to the reference, allowing for SNPs, an alignment score, a flag indicating a match to forward/reverse strand, and if available the alignment score for the next best match. The alignment files have names of the form "s_1_0013_realign.txt". There is one such file per tile on the flowcell. More information can be found in What do the different files in an analysis directory mean. SNPs and allele callingThere is no allele caller included with the pipeline at the moment, but we expect to include one soon. At the moment you would have to parse the alignment files manually and use the quality scores and positions in the reference to generate your own SNP calls. AssemblyWe do not currently provide a de novo assembler. However, de novo assembly of short reads is an active field of development at a significant number of external groups, and it appears that assembly of genomes of several Mb size using Solexa-type read pairs has been successfully simulated, and assembly of large contigs from real data has been demonstrated even without read-pairs. We expect assembly programs suitable for Solexa data to become available in the near future. VisualisationPipeline outputThe Gerald part of the pipeline produces a series of diagnostic QC plots and summary tables. These are presented in the form of html pages and can be found in the GERALD output folder. Again, more information can be found in What do the different files in an analysis directory mean. Gap4 and tgapThere is currently no genome browser included with the pipeline. Gap4 can now read Solexa data and display traces, but it is not suitable for large data volumes. Displaying a whole Genome Analyzer run is likely to be unfeasible. James Bonfield, the current gap4 maintainer, is developing the infrastructure for a new version (possibly called "gap5") that will be able to deal with the large data volumes produced by short read sequencing. James also provides a prototype utility called "tgap" (ftp://ftp.sanger.ac.uk/pub/PRODUCTION_SOFTWARE/src/tgap-0.3.tar.gz), which is a fast viewer for short-read data "provide[d] as a useful stop-gap measure while [James works] on improving gap4 for real". Furthermore, the Sanger provide a sample Genome Analyzer data set that can be displayed in the viewer. The Sanger Institute have been so kind as to put up their conversion scripts and programs for Illumina data on the Sanger website. UCSC browserData that is uniquely aligned to a genome could be viewed as a custom track in the UCSC genome browser (viewable only from the machine it was uploaded). More on UCSC custom tracks at http://genome.ucsc.edu/goldenPath/help/customTrack.html. To generate a track in the BED format from Gerald *_realign.txt files (let's say for lane 3, assuming 25 nt sequences): cat s_3_????_realign.txt|egrep -v '^#'| \ perl -ane 'if (@F>3){$_=~/(chr.+):(\d+)\s([F|R])/;print $1,"\t",$2,"\t",($2+25),"\n"}'\ > s3_customTrack.txt EnsemblAs part of a UK LINK grant, plug-ins and import scripts to laod and visualise Solexa data in Ensembl have been developed. This allows the incorporation of Solexa data in Ensembl views and adds various extensions, like the ability to view coverage tracks, stacked up reads and SNPs as well as intensity traces. However, this currently requires a full Ensembl installation at your site. If you are interested in this application, please get in touch with us - see also Pipeline FAQ. |
![]() |
Document generated by Confluence on Jul 25, 2008 16:42 |