Junction Alignments to Genome for RNA-seq Reads
Documentation
JAGuaR is an
alignment protocol for paired end RNA-seq reads based on an extended
reference. It uses BWA to align reads to
the genome and reference transcript models (including annotated exon-exon
junctions) specifically allowing the possibility for a single read to span
multiple exons. Reads aligned to the special reference repository are then
“repositioned” on to genomic coordinates, transforming reads spanning multiple
exons into large-gapped alignments.
Availability:
JAGuaR is implemented in Python, and is freely available for download at
http://www.bcgsc.ca/platform/bioinfo/software/jaguar
JAGuaR
creates a genome + junctions reference to be used for different read length
alignments. JAGuaR has been tested with
alignments using BWA (http://bio-bwa.sourceforge.net/). After reads are aligned to this reference
with BWA, convertJunctions converts the junction aligned reads to genomic
coordinates.
A) Pipeline
1. Create tab delimited transcript annotation file
in format which is used to create the junction reference where exon start and
end is 1-based inclusive:
transcript_id chr exon_start
exon_end exon_rank strand
This format can be easily generated using BioMart by choosing ‘structures’ and then clicking the appropriate categories in the GENE and EXON section in the order you want them to appear from left to right. http://www.biomart.org/biomart/martview/
See Section C for an example of a
formatted annotation file.
2. Run createJunctions
Creates the exon-exon junction
sequences for each chromosome and an index with the stub chrX_JN. Concatenates the files created in step2, with
the reference genome fasta (i.e. hg19.fa) file to create the genome + junction
reference sequence (i.e ref.fa) for alignment.
This has to be run once for
each read size such as 50bp, 75bp, 100bp to create ref.fa.
python
createJunctions.py -c
junction.cfg
The reference file containing exon-exon junction sequence needs to be
indexed before alignment.
bwa index -a bwtsw ref.fa
See Section B for an example
configuration file.
Note:
The chromosome naming used in the transcript annotation file should
match the naming used in the genome fasta. If the annotation includes chromosomes not
included in the genomic fasta, a message of the form "INFO Chromosome
<$chrName> is not included in the genomic fasta" will appear in the
createJunctions.log file. In addition
the version of BWA used to index the reference should be the same version used
for the initial alignment.
3. Alignment
Alignments of reads to the JAGuaR reference has
only been tested with BWA in the following pipeline:
bwa index -a bwtsw ref.fa [only needs to be run once/genome]
bwa aln ref.fa read1.fastq >
read1.sai
bwa aln ref.fa read2.fastq >
read2.sai
bwa sampe -s ref.fa read1.sai
read2.sai read1.fastq read2.fastq > aln.sam
JAGuaR has been tested with the
BWA-MEM algorithm and while some development is required for JAGuaR to make full
use of the BWA-MEM algorithm, running in its current version gives similar
results for BWA but overall runtime is much faster.
bwa-0.7.4/bwa mem -t 8 ref.fa
read1.fastq read2.fastq > aln.sam
For runs with BWA-MEM only, a post alignment script provided in
the JAGuaR package needs to be run before convertJunctions. This chooses the best read when a read is
multi-mapped or split.
python samFilterSelectReads.py aln.sam
> aln.filtered.sam
4. Run convertJunctions to create a repositioned sam file that is sorted by read name:
python
convertJunctions.py --samtools samtools –f aln.sam –o myaln -m 34622 -i ref100/
-s
To create a repositioned bam file sorted by coordinate and indexed:
python
convertJunctions.py --samtools samtools –f aln.sam –o myaln -m 34622 -i ref100/
-s -b -g GRCh37-lite.fa
5. Use samtools to create bam file (running
convertJunctions with -b and -g does
this step automatically):
samtools
faidx hg19.fa [index reference genome if not already done so]
samtools
view –Sb -T hg19.fa
myaln.repos.sam > myaln.repos.bam
samtools sort myaln.repos.bam
myaln.repos.sorted
samtools index
myaln.sorted.repos.bam [not needed for SNP calling but for IGV]
B)
CreateJunctions configuration file example
Junction.cfg
[GENOME_REF]
species=homo_sapiens
genome=hg19a
genome_fasta=GRCh37-lite.fa
transcript_file=ens61Genes.txt
[RUN_TIME]
out_dir=./results
seq_length=100
[HEADER_ANNOTATION]
SQ_header_AS=NCBI-Build-37
SQ_header_UR=GRCh37-lite.fa
SQ_header_SP=Homo sapiens
This
configuration file is required by createJunctions to setup the reference that
the RNA-seq reads are aligned to. Under
GENOME_REF, information on the species and genome name is required along with
the name and path of the genome reference
to be used and the transcriptome annotation file. Under RUN_TIME, list the output directory
that will hold the reference and indicate the size of the reads that will be
used. Header information is listed under
HEADER_ANNOTATION.
C)
Formatted transcriptome annotation
example
A tab delimited transcript text
file (referred to above as transcript_file) is required to create the junction
reference. Its format currently must be one line per exon for all transcripts
in the form of Transcript ID, Chromosome Name (with or without
"chr"), Exon Start Coordinate, Exon End Coordinate, Exon Rank in
Transcript, Strand) as per below. This
can easily be created at http://www.ensembl.org/biomart/ by selecting "Ensembl Genes", the genome dataset of
interest, "Attributes" and then "Structures" . Then choose in this order: Ensembl Transcript
ID, Chromosome Name, Exon Chr Start (bp), Exon Chr End (bp), Exon Rank in Transcript, Strand. Export all results to a TSV (tab separated
values) file. When read, the first
header line will be ignored.
Ensembl Transcript ID Chromosome Name Exon Chr Start (bp) Exon Chr End (bp) Exon Rank in Transcript Strand
ENST00000416909
13 96185952 96186164 1
-1
ENST00000416909
13 96180746 96180841 2
-1
ENST00000416909
13 96149313 96149379 3
-1
ENST00000416909
13 96131698 96132216 4
-1
ENST00000439928
13 24553949 24554063 1
1
ENST00000439928
13 24557866 24558072 2
1
ENST00000439928
13 24591806 24591840 3
1
ENST00000439928
13 24607652 24607908 4
1
ENST00000439928
13 24608060 24609166 5
1
D) ConvertJunctions input parameters
Input: SAM or BAM file of paired end tag (PET) reads
aligned to Jaguar reference/junctions with BWA (sampe -s) and sorted by name
Output: SAM file with genomic coordinates with
appropriate reads split by exon
Usage: convertJunctions.py [options]
Options:
-h,
--help show this help message
and exit
-f file PET sorted .sam or .bam (f)ile.
REQUIRED
-o o (O)utput stub for repositioned
file incl. directory,
otherwise output in cwd;
i.e., [Output tag].repos.sam.
REQUIRED
-i
index (I)ndex folder.
REQUIRED
-m
max_separation (M)ax separation. (M)ax
separation
recommendations: 34622
is used for hg19a, 23545 is used
for hg18 REQUIRED
-g
genome (G)enome reference
fasta file with path. Required if
using option -b
--samtools=SAM_EXEC SAMTOOLS executable
-b create (B)am after sam conversion
-c
check chromosome index check,
default=1
-d print debugging messages to
screen
-s create cigar.txt (CIGAR
errors) and .checks (FLAG
errors) files
-x
chromosome filter by comma
delimited list of chromosomes, default
all chromosomes listed
in genome length file
-l l Number (L)ines/reads to
reposition from beginning of -f
file
-r r pull out read(s) and create
reads.sam
-n NO to repositioning, assumes
a repositioned file
already exists
-e Create bed file of exon-exon
junction covered read
Coordinates
--do_not_check_mito_naming
if not specified,
mitochondrial chromosome M (chrM)
will be renamed MT
(chrMT)
--do_not_drop_chr_prefix
If not specified, the
'c-h-r' prefix of any chromosome
name will be dropped
(chr8 becomes 8)
E)
Example Output after Repositioning
JAGuaR Summary
-----------------------------------------------------------------------------
Start: Wed Mar
6 12:57:48 2014
Finish: Wed Mar 6 18:55:50 2014
File analyzed: testrun.bam
Output file with repositioned reads:
testrun.repos.sam
Reference Index Folder: /ref/
Repositioning time: 357.450638203
minutes
Max Separation: 34622
Total number of reads: 246197358
Total number of reads aligned to
junctions: 38736275
Junction reads already unmapped in
first alignment: 2072703
Junction reads set unmapped after
repositioning: 3304
Junction reads mapped and
repositioned with large gaps: 36432882
Junction reads repositioned without
large gaps: 227940
Number of times one read covers 1,
2, 3 or more junctions:
1 33254992
2 3124861
3 52116
4 913
Total number of times an exon-exon
junction is covered: 39664714
Total number of exon-exon junctions
covered by at least one read: 209518
--
F) Definition of TLEN implemented in JAGuaR
This program conforms to the Picard spec for TLEN (insert size).
TLEN = distance between 5’ ends
TLEN = my mate's start - my start
If a read is +, its start is its POS. If a read is - (ie. it has the 16 bit), its start is POS + the genomic distance the read covers.
For example, if both reads in a pair are -, the TLEN will be calculated as the distance between the rightmost base of read1 and the rightmost base of read2.
If the reads are a normal forward-reverse proper pair, the TLEN will be calculated as the distance between the leftmost base of read1 and the rightmost base of read2
G) Example Set
A small sample set is available for download and testing:
http://www.bcgsc.ca/downloads/JAGuaR/example
PET fastqs: seq1.fastq.gz, seq2.fastq.gz
Configuration file: junction.cfg
Transcript annotation file: GRCh37p13.txt
Genome reference file: GRCh37-lite.fa.gz
Final bam after JAGuaR run: bwa_mem_aln.bam
1) Download and unzip fastqs and genome fasta file.
2) Download junction.cfg configuration file and example transcript annotation file.
2) Make sure junction.cfg points to correct path of genome reference and transcript file.
3) Run createJunctions and index the resulting ref.fa reference as per step 2 in "Pipeline".
4) Align PET fastqs with BWA (or BWA-MEM) to ref.ca (generated above) as per step 3 in "Pipeline".
5) Run convertJunctions as per step 4 in "Pipeline".
6) Resulting file should be similar to bwa_mem_aln.bam (also available for download).
Comments, questions and concerns can be sent to:
ybutterf@bcgsc.ca
May, 2014