Data analysis - documentation : Whole genome alignments using ELAND
This page last changed on Mar 05, 2008 by ac.
Copyright (c) Illumina Inc. 2007. All rights reserved. Author: Anthony J. Cox PurposeThis page contains some notes how to do large scale alignments using ELAND. ("the software formerly known as IMPALA") ELAND stands for E fficient L arge-Scale A lignment of N ucleotide D atabases. ELAND searches a set of large DNA files for a large number of short DNA reads allowing up to 2 errors per match. Here CompilationEland gets compiled automatically as part of the pipeline compilation. All necessary source files now live in the Eland subfolder of the Pipeline CVS tree, with the exception of GlobalUtilities.h and GlobalUtilities.cpp (both also required by PhageAlign), which live in subfolder Misc. Each read in your query set of reads must be of the same length and ELAND must be compiled to match. If you wanted a manual compilation, to do this go to Pipeline/Eland and type e.g. make eland_20 for reads of length 20 - eland_20 is the executable that you should run. Alternatively, set OLIGO_LENGTH in your environment to 20 and do make -e eland - in this case your executable will be called just eland. Preparing the reference genomeEach read in your query set of reads must be of the same length. These reads are aligned to a target set of large DNA sequences. These sequences must be in Fasta format. An example: >X dna:chromosome chromosome:NCBI36:X:1:154913754:1 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTCTGAAAGTGGACCTATCAGCAG GATGTGGGTGGGAGCAGATTAGAGAATAAAAGCAGACTGCCTGAGCCAGCAGTGGCAACC CAATGGGGTCCCTTTCCATACTGTGGAAGCTTCGTTCTTTCACTCTTTGCAATAAATCTT They must first be "squashed" into the 2-bits-per-base format that ELAND reads. This is done by
At the end of this procedure, your directory path/myGenome will look similar to this: > ls path/myGenome fastaFile1.fa.2bpb fastaFile1.fa.vld Until pipeline version 0.3, one was restricted to a single entry per fasta file for the target sequences. For pipeline version 0.3 this restriction has been removed. If fastaFile1 contained multiple entries then an additional file fastaFile1.fa.idx will be present in path/myGenome. For speed you may want to make room for the squashed files on your machine's local hard drive rather than keeping them on an external analysis drive. As an example, the squashed files for the 3Gbase human genome will take up 750Mbytes of space (750 Mbytes x 4 bases per byte = 3 Gbases). Limitations on the size of squashed genome files
The design goal of ELAND was to match a large number of reads against the human genome, ie a small number of large pieces of DNA. The key to doing this efficiently is to fit as large a batch of reads into RAM as possible. Accordingly it stores a match position (chromosome number plus position in chromosome) in 4 bytes, which in turn limits the total length of DNA that ELAND can match against in a single run. The most significant of the 4 bytes is used as a block number, and can range between 1 and 239 (codes 0 and 240 to 255 are used to denote repeats). The remaining 3 bytes are used to denote position in a block, which can therefore be up to 2^24=16Mb in size. Each file must take up a whole number of blocks.
Running ELAND as a standalone program
The eland_extended and eland_pair modes introduced in Pipeline version 0.3 share a common 'export' file format. The intention of this file is to combine all the important information for a lane into one file (or two files in the case of a paired read run). These scripts:
There seems to be an increasingly common requirement to run this set of scripts in a 'standalone' fashion. This can be done via the script Pipeline/Eland/ELAND_standalone.pl. Minimal usage: ./Pipeline/Eland/ELAND_standalone.pl -if read1.fastq -if read2.fastq -it fastq -eg /lustre/data01/Mondas_software/Genomes/E_coli_ELAND Full list of options:
NB: for a paired read analysis, both reads must share same TBDs:
ELAND_standalone is meant for:
If you want to e.g. produce a set of export files for submission then it is probably best to re-run GERALD part of the pipeline to get a 'proper' analysis. Pitfall: A quick and dirty conversion from 'rest of the world' to 'us' can be done by cat file.fastq | perl -lane '{ if ($l=~/^\+/) { s/(.)/chr(ord($1)+32)/eg;} print; $l=$_ }' > new.fastq I propose to make ELAND_standalone.pl part of the 0.4 release as several of our informed external users have asked for it and it can't break the existing pipeline code to do so. Running the ELAND executable
The command line syntax to run ELAND is eland_executable queryFile.txt squashedGenomeDir outFile.txt [repeatFile.txt]
For historical reasons a slightly different file format is used if an output name ending in .vmf is specified.
Running ELAND from the pipeline using GERALDThis requires revision v1.12 or later of GERALD.pl in Pipeline/Gerald. For more information on Gerald see Gerald User Guide and FAQ. The config file you run GERALD on should contain lines such as ELAND_GENOME /usr/local/share/eland/ncbi35 (or wherever your directory of squashed genome files is held) ELAND_REPEAT /usr/local/share/eland/reps30_5 (or wherever your list of repeats is held) ANALYSIS eland or use 34:ANALYSIS eland to run ELAND on particular lanes. Note that ELAND_GENOME referes to a directory, not a file. The usual Gerald variables GENOME_DIR and GENOME_FILE are not used for Eland analysis. As described above, Eland expects a different file format and not Fasta. You should then be able to run make. Under the hood a script convertToFasta.pl converts and concatenates all the reads into a single large FASTA file which is then passed as an input to ELAND. A script convertFromELAND.pl converts the results back into the by-tile PhageAlign format expected by the rest of the pipeline.
Eland uses up to 1GB of memory. The pipeline starts one Eland job per lane. In order to prevent most computers from running out of memory, an artificial dependency in the GERALD Makefile prevents multiple instances of Eland from running at the same time. This limitation can be removed by using the option ELAND_MULTIPLE_INSTANCES 8 in the Gerald config file. Be aware that this may use up to 8 GB of memory on your analysis computer; if not enough memory is available, the analysis is likely to crash. Allowed values for this option are 1, 2, 4, 8. (1 means no lane parallelisation, and uses up to 1 GB RAM, 2 means two parallel jobs and up to 2 GB etc.) ELAND features, "features" and TBDs
Appendix: How ELAND handles missing bases
First note that missing bases need to be specified as "N" characters and not "." as in the sequence files. This conversion is handled automatically by GERALD but you need to be aware of it when running ELAND as a standalone program. ELAND allows up to (read length/4) "external" Ns - i.e at the beginning or end of the read. These are ignored. ELAND also allows up to 2 "internal" Ns - these are interpreted in one of two ways, which are given equal weight (i.e. if a read with one type 'D' match and one deletion-N match will be called as a repeat). A 'type D' N is of the form: Read: ACNGT Genome: ACCGT i.e. the base is there but not detected. A 'type I' N is of the form: Read: ACNGT Genome: AC-GT i.e. it has skipped a base. 'Type D' and 'type I' Ns are given equal weight In lining up bases in your read with bases at the position it aligns to in the reference, you should I do have a Perl script that correctly interprets the rules described above, but given that Ns shouldn't occur that often it might be simpler just to ignore N-containing reads. For ancient historical reasons 'D' stands for 'detection error' and 'I' stands for 'incorporation error' - I realise this is confusing as the temptation is to interpret them as 'deletion' and 'insertion.' I spent considerable time trying to get the N handling to work as described above, but the way it handles Ns is based on very old assumptions about how the (then non existent!) machine would behave - at the time we were intending to sequence single molecules for which I assumed N errors would be a lot more common and would be likely to be of the type described. Most Ns we actually see in practice are due to clusters at the edge of tiles wandering off the edge of the image for a cycle or two due to imperfect re-mapping of the tile position at different cycles - hence giving a 'type D' error. Otherwise the pipeline software strongly prefers to make a base call if at all possible, even if the call is of low quality. A low priority TBD is to allow the N handler to be switched to 'type D errors only' (which better reflects the way the platform behaves) or just restricted to external Ns. |
![]() |
Document generated by Confluence on Jul 25, 2008 16:42 |