This page last changed on Apr 03, 2007 by maising.

The analysis pipeline uses an open, text-file based architecture. As a result, you will have access to many different pieces of intermediate information. This page is meant to describe the most important pieces of information that you will most likely want to look at.

Sequences and quality scores

At the most basic level, the pipeline provides a set of sequences and base-specific quality scores. The sequences are bundled by lane and come in one of several text formats; the choice of format is configurable by the user. The currently supported formats are:

  • fastA: Well known, but no quality scores.
  • fastQ: An adaption of the fastA format that contains quality scores. However, the fastQ format is not completely compatible with the fastQ files currently in existence, which can be read by various applications (for example, BioPerl): Because we use a larger dynamic range of quality scores, the quality scores are encoded in ASCII as 64+score, instead of the standard 32+score. This is to avoid running into non-printable characters.
  • SCARF (Solexa compact ASCII read format): Another easy-to-parse text based format, in which all the information for a single read is stored in one line.
    Quality scores can be configured to be stored as symbolic ASCII values or numeric values. The parameters that allow the configuration of the output format are described in GERALD User Guide and FAQ.

The output files are called s_1_sequence.txt, s_2_sequence.txt, ... and can be found in the GERALD folder of a finished analysis run. More information can be found in What do the different files in an analysis directory mean.

Alignments

If a reference file has been specified, the analysis can produce a set of "alignment files" that indicate the position of the best match to the reference, allowing for SNPs, an alignment score, a flag indicating a match to forward/reverse strand, and if available the alignment score for the next best match. The alignment files have names of the form "s_1_0013_realign.txt". There is one such file per tile on the flowcell. More information can be found in What do the different files in an analysis directory mean.

SNPs and allele calling

There is no allele caller included with the pipeline at the moment, but we expect to include one soon. At the moment you would have to parse the alignment files manually and use the quality scores and positions in the reference to generate your own SNP calls.

Assembly

We do not currently provide a de novo assembler. However, de novo assembly of short reads is an active field of development at a significant number of external groups, and it appears that assembly of genomes of several Mb size using Solexa-type read pairs has been successfully simulated, and assembly of large contigs from real data has been demonstrated even without read-pairs. We expect assembly programs suitable for Solexa data to become available in the near future.

Visualisation

Pipeline output

The Gerald part of the pipeline produces a series of diagnostic QC plots and summary tables. These are presented in the form of html pages and can be found in the GERALD output folder. Again, more information can be found in What do the different files in an analysis directory mean.

Gap4 and tgap

There is currently no genome browser included with the pipeline, but we are looking to provide suitable tools soon. Gap4 can now read Solexa data and display traces, but it is still being optimised for the large data volumes. James Bonfield and the Sanger Institute have been so kind as to put up their conversion scripts and programs on the Sanger website. James also provides a utility called "tgap" (ftp://ftp.sanger.ac.uk/pub/PRODUCTION_SOFTWARE/src/tgap-0.3.tar.gz), which is a fast viewer for short-read data "provide[d] as a useful stop-gap measure while [James works] on improving gap4 for real". Furthermore, the Sanger provide a sample Solexa data set that can be displayed in the viewer.

UCSC browser

Data that is uniquely aligned to a genome could be viewed as a custom track in the UCSC genome browser (viewable only from the machine it was uploaded). More on UCSC custom tracks at http://genome.ucsc.edu/goldenPath/help/customTrack.html. To generate a track in the BED format from Gerald *_realign.txt files (let's say for lane 3, assuming 25 nt sequences):

cat s_3_????_realign.txt|egrep -v '^#'|perl -ane 'if (@F>3){$_=~/(chr.+):(\d+)\s([F|R])/;print $1,"\t",$2,"\t",($2+25),"\n"}'> s3_customTrack.txt

Ensembl

As part of a UK LINK grant, plug-ins and import scripts to laod and visualise Solexa data in Ensembl have been developed. This allows the incorporation of Solexa data in Ensembl views and adds various extensions, like the ability to view coverage tracks, stacked up reads and SNPs as well as intensity traces. However, this currently requires a full Ensembl installation at your site. If you are interested in this application, please get in touch with us - see also Pipeline FAQ.

Document generated by Confluence on Apr 05, 2007 11:25