This page last changed on Aug 17, 2007 by maising.

Installation/setup

Our Perl/Python interpreters (or /usr/bin/env) are not in the path specified in your scripts' shebangs line.

We've tried to use /usr/bin/env in the scripts' where possible; but if that does not suit your needs, setting this up properly is really something you have to figure out youself; quoting from O'Reilly's "Learning Perl":

Just say [to your system administrator]: "You know, I read in a book that both /usr/bin/perl and /usr/local/bin/perl should be symbolic links to the true Perl binary," and under the influence of your spell the admin will make everything work.

What parallelisation/cluster solutions do you recommend?

Please ask Illumina for recommended hardware solutions.

In principle, any parallelised derivative of a recent version of GNU make should be able to run the pipeline on a cluster. However, your luck will also depend on the version of make used:

  1. GNU make 3.76 does not support target-specific variables. We do not currently intend to support make 3.76.
  2. As far as we can see, GNU make 3.77 should provide a sufficient feature set to run the pipeline; unfortunately, it appears to be somewhat buggy in its support for the new features and we have not been able to run the whole pipeline successfully. Derivatives may fix some of the bugs.
  3. Any derivative of make 3.78 or later should be fine.

For more information on parallelisation see Pipeline parallelisation.

We would be extremely grateful for feedback if you manage to run the pipeline under any other distributed make/cluster batch queueing system.

Goat/Bustard

I run the goat_pipeline.py/bustard.py script, but it does not actually produce any output in the Data folder!

Add the "--make" option.

Gerald

The pipeline does not produce any graphical output!

Make sure that gnuplot and ImageMagick's convert are installed and in the path for the user/machine the pipeline is running on. Make sure that ghostscript is installed. Test postsctipt output for gnuplot.

Make sure that the unfortunately named "convert" utility from the Staden package does not come before the ImageMagick convert in your path.

What should the contaminant file look like?

The contaminant file allows to calculate a second set of alignments with respect to a limited set of known contaminant sequences. These are given to the pipeline in "tag" or "monotemplate" mode; i.e., reads are aligned directly against the tags, starting from the first base, and no attempt is made to find an optimal alignment position.

If you want to remove contamination from a genomic sample with a large number of sequences at different base positions, the recommended solution is to create a combined fast-a file containing the sequence of your sample and the suspected contaminant genome, separated by a series of "padding" bases ("N"), whose number should be equal to or greater than the experimental read length, to avoid spurious alignments across the genome boundaries.

Integration with other tools

How can I submit Genome Analyzer data to a trace archive?

There is currently an active discussion between various archives, research institutes and the commericial companies developing short-read technologies to standardise the short-read data formats. This effort is currently organised by British Columbia, and there is an email list ssrformat@bcgsc.ca. Subscriptions to the list can be done through http://www.bcgsc.ca/mailman/listinfo/ssrformat. We expect that formats will soon be in place to make submissions to public archives.

How can I load Genome Analyzer data into Ensembl?

Integration with Ensembl is under active development at Illumina UK as part of a UK Department of Trade and Industry and Medical Research Council LINK grant to support a collaborative project with the European Bioinformatics Institute, Imperial College London and Wellcome Trust Sanger Institute. A working version running at Illumina. It requires modifications to release 39, and a separate database. All of this work will become publicly available, but there is no official release date yet; if you are interested in this, contact us.

How can I get the pipeline output into gap4?

The latest version of gap4 has support for Genome Analyzer traces, but it struggles to handle large volumes of short-read data. An intermediate stop-gap version called "tgap", which can be used while gap5 is being developed, can be obtained from the website of the Sanger institute - see Pipeline output and visualisation. The site also provides conversion utilities for Genome Analyzer data.

How can I import pipeline output into my database/favourite viewer/analysis package?

We do not currently have any explicit export functionality for packages other than gap4 or Ensembl. However, by keeping all the final and intermediate analysis output in flat file formats and XML, we were hoping to make it easy for others to integrate the output into their own infrastructure. Future versions of the pipeline will most likely make use of database engines for the storage of analysis information.

General

What do all the files in the run-folder mean?

Most of the important files are described - at least at a basic level - in the Run-folder specification and What do the different files in an analysis directory mean. If these files do not contain enough information, please contact us directly. Please note that some output files are not documented because they may be considered intermediate files for the software's own purpose, or the format may not be considered to be finalised and may change in future versions. We will also try to provide more documentation in future releases. Last but not least, you have access to the source code producing any of the output files, which may give some insight as to what the numbers in a file mean.

Who do I contact with questions regarding the pipeline?

We have an email address "techsupport@illumina.com" for any support questions.

Document generated by Confluence on Jan 11, 2008 15:41