Theory and Models¶
Structural Variants in this tool are defined as a pair of breakpoints. A breakpoint is a genomic position (interval) on some reference/template/chromosome which has a strand and orientation. The orientation describes the portion of the reference that is retained.
Types of Flanking evidence¶
One of the most confusing parts about working with contig and paired-end reads is relating them to the breakpoint so that you can determine which types will support an event. The flanking read types we outline here are similarly described by IGV. We have used similar coloring for the read pairs in the following diagrams to facilitate ease of use for those already familiar with viewing bam files in IGV.
Note
The major assumptions here are that the ‘normal’ read-pair is a read pair which has one read on the positive/forward strand and its partner on the negative/reverse strand. It is assumed that partners share a read name, as is the case for Illumina reads.
Deletion¶
For a deletion, we expect the flanking reads to be in the normal orientation but that the fragment size should be abnormal (for large deletions).
Insertion¶
Duplication¶
Calculating the Evidence Window¶
We make some base assumptions with regards to paired-end read data:
Note
the distribution of fragment sizes approximately follows a normal distribution
Note
the most common fragment size is the unmutated ‘normal’ fragment
With the above assumptions we take the median fragment size to be the expected normal.
Given that we expect mutations and therefore abnormal fragment sizes, we use a modified method to calculate the median standard deviation (see code below). We calculate the squared distance away from the median for each fragment and then take a fraction of this to be ‘normal’ variation. So the most abnormal portion is ignored, assuming it is supposed to be abnormal. This results in a calculation as follows.
from statistics import median
import math
fragments = [abs(read.template_length) for read in reads] # the fragment sizes of the reads
f = 0.95 # fraction
m = median(fragments) # get the median fragment size value
X = [math.pow(i - m, 2) for i in fragments] # take the square error for each point
end = int(round(len(X) * f))
X = sorted(X)[0:end]
stdev = math.sqrt(sum(X) / len(X))
This gives us an idea of when to judge an fragment size as abnormal and where we expect our normal read pairs fragment sizes to fall.
As we can see from the diagram above, removing the outliers reproduces the observed distribution better than using all data points
We use this in two ways
- to find flanking evidence supporting deletions and insertions
- to estimate the window size for where we will need to read from the bam when looking for evidence for a given event
The _generate_window()
function uses the above concepts.
The user will define the median_fragment_size
the stdev_fragment_size
, and the stdev_count_abnormal
parameters defined in the VALIDATION_DEFAULTS
class.
If the library has a transcriptome protocol this becomes a bit more complicated and we must take into account the
possible annotations when calculating the evidence window. see
_generate_window()
for more
Classifying Events¶
The following decision tree is used in classifying events based on their breakpoints. Only valid combinations have
been shown. see classify()
Assembling Contigs¶
During validation, for each breakpoint pair, we attempt to assemble a contig to represent the sequence across the breakpoints. This is assembled from the supporting reads (split read, half-mapped read, flanking read pair, and spanning read) which have already been collected for the given event. The sequence from each read and its reverse complement are assembled into contigs using a DeBruijn graph. For strand specific events, we then attempt to resolve the sequence strand of the contig.
Breakpoints can be called by multiple different CALL_METHOD
.
Calling Breakpoints by Flanking Evidence¶
Breakpoints are called by contig, split-read, or flanking pairs evidence. Contigs and split reads are used to call exact breakpoints, where breakpoints called by flanking reads are generally assigned a probabilistic range.
The metrics used here are similar to those used in calculating the evidence window. We use the max_expected_fragment_size as the outer limit of how large the range can be. This is further refined taking into account the range spanned by the flanking read pair evidence and the position of the opposing breakpoint.
Annotating Events¶
We make the following assumptions when determining the annotations for each event
Note
If both breakpoints are in the same gene, they must also be in the same transcript
Note
If the breakpoint intervals overlap we do not annotate encompassed genes
Note
Encompassed and ‘nearest’ genes are reported without respect to strand
There are specific questions we want annotation to answer. We collect gene level annotations which describes things like what gene is near the breakpoint (useful in the case of a potential promoter swap); what genes (besides the one selected) also overlap the breakpoint; what genes are encompassed between the breakpoints (for example in a deletion event the genes that would be deleted).
Next there are the fusion-product level annotations. If the event result in a fusion transcript, the sequence of the fusion transcript is computed. This is translated to a putative amino acid sequence from which protein metrics such as the possible ORFs and domain sequences can be computed.
Predicting Splicing Patterns¶
After the events have been called and an annotation has been attached, we often want to predict information about the putative fusion protein, which may be a product. In some cases, when a fusion transcript disrupts a splice-site, it is not clear what the processed fusion transcript may be. MAVIS will calculate all possibilities according to the following model.
For a given list of non-abrogated splice sites (listed 5’ to 3’ on the strand of the transcript) donor splice sites are paired with all following as seen below
More complex examples are drawn below. There are five classifications (SPLICE_TYPE
) for the different splicing patterns:
- Retained intron (
RETAIN
) - Skipped exon (
SKIP
) - Multiple retained introns (
MULTI_RETAIN
) - Multiple skipped exons (
MULTI_SKIP
) - Some combination of retained introns and skipped exons (
COMPLEX
)
Pairing Similar Events¶
After breakpoints have been called and annotated we often need to see if the same event was found in different samples. To do this we will need to compare events between genome and transcriptome libraries. For this, the following model is proposed. To compare events between different protocol (genome vs transcriptome) we use the annotation overlying the genome breakpoint and the splicing model we defined above to predict where we would expect to find the transcriptomic breakpoints. This gives rise to the following basic cases.
Note
In all cases the predicted breakpoint is either the same as the genomic breakpoint, or it is the same as the nearest retained donor/acceptor to the breakpoint.
Glossary¶
- 2bit
- see UCSC
- bam
- see UCSC
- bed
- see UCSC
- blat
- Alignment tool. see https://genome.ucsc.edu/FAQ/FAQblat.html
- breakpoint
- A breakpoint is a genomic position (interval) on some reference/template/chromosome which has a strand and orientation. The orientation describes the portion of the reference that is retained.
- breakpoint pair
- structural variant which has not been classified/given a type
- event
- used interchangeably with structural variant
- event type
- classification for a structural variant. see event_type
- fasta
- see UCSC
- flanking read pair
- a pair of reads where one read maps to one side of a set of breakpoints and its mate maps to the other
- half-mapped read
- a read whose mate is unaligned. Generally this refers to reads in the evidence stage that are mapped next to a breakpoint.
- IGV
- Visualization tool. see http://software.broadinstitute.org/software/igv/
- IGV batch file
- This is a file format type defined by IGV see running IGV with a batch file
- JSON
- JSON (JavaScript Object Notation) is a data file format. see w3 schools
- psl
- see UCSC
- pslx
- extended format of a psl
- spanning read
- applies primarily to small structural variants. Reads which span both breakpoints
- split read
- a read which aligns next to a breakpoint and is softclipped at one or more sides
- structural variant
- a genomic alteration that can be described by a pair of breakpoints and an event type. The two breakpoints represent regions in the genome that are broken apart and reattached together.
- SVG
- SVG (Scalable vector graph) is an image format. see w3 schools