constants module¶

module responsible for small utility functions and constants used throughout the structural_variant package

mavis.constants.CALL_METHOD = <vocab.vocab.Vocab object>¶

Vocab – holds controlled vocabulary for allowed call methods

CONTIG: a contig was assembled and aligned across the breakpoints
SPLIT: the event was called by split read
FLANK: the event was called by flanking read pair
SPAN: the event was called by spanning read

mavis.constants.CIGAR = <vocab.vocab.Vocab object>¶

Vocab – Enum-like. For readable cigar values

M: alignment match (can be a sequence match or mismatch)
I: insertion to the reference
D: deletion from the reference
N: skipped region from the reference
S: soft clipping (clipped sequences present in SEQ)
H: hard clipping (clipped sequences NOT present in SEQ)
P: padding (silent deletion from padded reference)
EQ: sequence match (=)
X: sequence mismatch

note: descriptions are taken from the samfile documentation

mavis.constants.CODON_SIZE = 3¶: int – the number of bases making up a codon

mavis.constants.COLUMNS = <vocab.vocab.Vocab object>¶

Vocab – Column names for i/o files used throughout the pipeline

annotation_figure: File path to the svg drawing representing the annotation
annotation_figure_legend: JSON - JSON data for the figure legend
annotation_id: Identifier for the annotation step
break1_call_method: CALL_METHOD - The method used to call the first breakpoint
break1_chromosome: str - The name of the chromosome on which breakpoint 1 is situated
break1_ewindow: Window where evidence was gathered for the first breakpoint
break1_ewindow_count: int - Number of reads processed/looked-at in the first evidence window
break1_ewindow_practical_coverage: float - break2_ewindow_practical_coverage, break1_ewindow_count / len(break1_ewindow). Not the actual coverage as bins are sampled within and there is a read limit cutoff
break1_homologous_seq: str - Sequence in common at the first breakpoint and other side of the second breakpoint
break1_orientation: ORIENT - The side of the breakpoint wrt the positive/forward strand that is retained.
break1_position_end: int - End integer inclusive 1-based of the range representing breakpoint 1
break1_position_start: int - Start integer inclusive 1-based of the range representing breakpoint 1
break1_seq: str - The sequence up to and including the breakpoint. Always given wrt to the positive/forward strand
break1_split_reads: int - Number of split reads that call the exact breakpoint given
break1_split_reads_forced: int - Number of split reads which were aligned to the opposite breakpoint window using a targeted alignment
break1_strand: STRAND - The strand wrt to the reference positive/forward strand at this breakpoint.
break2_call_method: CALL_METHOD - The method used to call the second breakpoint
break2_chromosome: The name of the chromosome on which breakpoint 2 is situated
break2_ewindow: Window where evidence was gathered for the second breakpoint
break2_ewindow_count: int - Number of reads processed/looked-at in the second evidence window
break2_ewindow_practical_coverage: float - break2_ewindow_practical_coverage, break2_ewindow_count / len(break2_ewindow). Not the actual coverage as bins are sampled within and there is a read limit cutoff
break2_homologous_seq: str - Sequence in common at the second breakpoint and other side of the first breakpoint
break2_orientation: ORIENT - The side of the breakpoint wrt the positive/forward strand that is retained.
break2_position_end: int - End integer inclusive 1-based of the range representing breakpoint 2
break2_position_start: int - Start integer inclusive 1-based of the range representing breakpoint 2
break2_seq: str - The sequence up to and including the breakpoint. Always given wrt to the positive/forward strand
break2_split_reads: int - Number of split reads that call the exact breakpoint given
break2_split_reads_forced: int - Number of split reads which were aligned to the opposite breakpoint window using a targeted alignment
break2_strand: STRAND - The strand wrt to the reference positive/forward strand at this breakpoint.
cluster_id: Identifier for the merging/clustering step
cluster_size: The number of breakpoint pair calls that were grouped in creating the cluster
contig_alignment_cigar: The cigar string(s) representing the contig alignment. Semi-colon delimited
contig_alignment_query_name: The query name for the contig alignment. Should match the ‘read’ name(s) in the .contigs.bam output file
contig_alignment_reference_start: The reference start(s) <chr>:<position> of the contig alignment. Semi-colon delimited
contig_alignment_score: float - A rank based on the alignment tool blat etc. of the alignment being used. An average if split alignments were used. Lower numbers indicate a better alignment. If it was the best alignment possible then this would be zero.
contig_build_score: int - Score representing the edge weights of all edges used in building the sequence
contig_remap_score: float - Score representing the number of sequences from the set of sequences given to the assembly algorithm that were aligned to the resulting contig with an acceptable scoring based on user-set thresholds. For any sequence its contribution to the score is divided by the number of mappings to give less weight to multimaps
contig_remapped_read_names: read query names for the reads that were remapped. A -1 or -2 has been appended to the end of the name to indicate if this is the first or second read in the pair
contig_remapped_reads: int - the number of reads from the input bam that map to the assembled contig
contig_seq: str - Sequence of the current contig wrt to the positive forward strand if not strand specific
contig_strand_specific: bool - A flag to indicate if it was possible to resolve the strand for this contig
contigs_aligned: int - Number of contigs that were able to align
contigs_assembled: int - Number of contigs that were built from split read sequences
event_type: SVTYPE - The classification of the event
flanking_median_fragment_size: int - The median fragment size of the flanking reads being used as evidence
flanking_pairs: int - Number of read-pairs where one read aligns to the first breakpoint window and the second read aligns to the other. The count here is based on the number of unique query names
flanking_pairs_compatible: int - Number of flanking pairs of a compatible orientation type. This applies to insertions and duplications. Flanking pairs supporting an insertion will be compatible to a duplication and flanking pairs supporting a duplication will be compatible to an insertion (possibly indicating an internal translocation)
flanking_stdev_fragment_size: float - The standard deviation in fragment size of the flanking reads being used as evidence
fusion_cdna_coding_end: Position wrt the 5’ end of the fusion transcript where coding ends last base of the stop codon
fusion_cdna_coding_start: Position wrt the 5’ end of the fusion transcript where coding begins first base of the Met amino acid.
fusion_mapped_domains: JSON - List of domains in JSON format where each domain start and end positions are given wrt to the fusion transcript and the mapping quality is the number of matching amino acid positions over the total number of amino acids. The sequence is the amino acid sequence of the domain on the reference/original transcript
fusion_sequence_fasta_file: Path to the corresponding fasta output file
fusion_sequence_fasta_id: The sequence identifier for the cdna sequence output fasta file
fusion_splicing_pattern: SPLICE_TYPE - Type of splicing pattern used to create the fusion cDNA.
gene1: Gene for the current annotation at the first breakpoint
gene1_aliases: Other gene names associated with the current annotation at the first breakpoint
gene1_direction: PRIME - The direction/prime of the gene
gene2: Gene for the current annotation at the second breakpoint
gene2_aliases: Other gene names associated with the current annotation at the second breakpoint
gene2_direction: PRIME - The direction/prime of the gene. Has the following possible values
gene_product_type: GENE_PRODUCT_TYPE - Describes if the putative fusion product will be sense or anti-sense
genes_encompassed: Applies to intrachromosomal events only. List of genes which overlap any region that occurs between both breakpoints. For example in a deletion event these would be deleted genes.
genes_overlapping_break1: list of genes which overlap the first breakpoint
genes_overlapping_break2: list of genes which overlap the second breakpoint
genes_proximal_to_break1: list of genes near the breakpoint and the distance away from the breakpoint
genes_proximal_to_break2: list of genes near the breakpoint and the distance away from the breakpoint
library: Identifier for the library/source
linking_split_reads: int - Number of split reads that align to both breakpoints
opposing_strands: bool - Specifies if breakpoints are on opposite strands wrt to the reference. Expects a boolean
pairing: A semi colon delimited of event identifiers i.e. <annotation_id>_<splicing pattern>_<cds start>_<cds end>
product_id: Unique identifier of the final fusion including splicing and ORF decision from the annotation step
protocol: PROTOCOL - Specifies the type of library
raw_break1_split_reads: int - Number of split reads before calling the breakpoint
raw_break2_split_reads: int - Number of split reads before calling the breakpoint
raw_flanking_pairs: int - Number of flanking reads before calling the breakpoint. The count here is based on the number of unique query names
raw_spanning_reads: int - Number of spanning reads collected during evidence collection before calling the breakpoint
spanning_read_names: read query names of the spanning reads which support the current event
spanning_reads: int - the number of spanning reads which support the event
stranded: bool - Specifies if the sequencing protocol was strand specific or not. Expects a boolean
tools: The tools that called the event originally from the cluster step. Should be a semi-colon delimited list of <tool name>_<tool version>
transcript1: Transcript for the current annotation at the first breakpoint
transcript2: Transcript for the current annotation at the second breakpoint
untemplated_seq: str - The untemplated/novel sequence between the breakpoints
validation_id: Identifier for the validation step

mavis.constants.DISEASE_STATUS = <vocab.vocab.Vocab object>¶

Vocab – holds controlled vocabulary for allowed disease status

DISEASED: diseased
NORMAL: normal

mavis.constants.GENE_PRODUCT_TYPE = <vocab.vocab.Vocab object>¶

Vocab – controlled vocabulary for gene products

SENSE: the gene product is a sense fusion
ANTI_SENSE: the gene product is anti-sense

mavis.constants.GIESMA_STAIN = <vocab.vocab.Vocab object>¶: Vocab – holds controlled vocabulary relating to stains of chromosome bands

mavis.constants.NA_MAPPING_QUALITY = 255¶: int – mapping quality value to indicate mapping was not performed/calculated

mavis.constants.ORIENT = <vocab.vocab.Vocab object>¶

Vocab – holds controlled vocabulary for allowed orientation values

LEFT: left wrt to the positive/forward strand
RIGHT: right wrt to the positive/forward strand
NS: orientation is not specified

mavis.constants.PRIME = <vocab.vocab.Vocab object>¶

Vocab – holds controlled vocabulary

FIVE: five prime
THREE: three prime

mavis.constants.PROTOCOL = <vocab.vocab.Vocab object>¶

Vocab – holds controlled vocabulary for allowed protocol values

GENOME: genome
TRANS: transcriptome

mavis.constants.PYSAM_READ_FLAGS = <vocab.vocab.Vocab object>¶

Vocab – Enum-like. For readable PYSAM flag constants

MULTIMAP: template having multiple segments in sequencing
UNMAPPED: segment unmapped
MATE_UNMAPPED: next segment in the template unmapped
REVERSE: SEQ being reverse complemented
MATE_REVERSE: SEQ of the next segment in the template being reverse complemented
FIRST_IN_PAIR: the first segment in the template
LAST_IN_PAIR: the last segment in the template
SECONDARY: secondary alignment
SUPPLEMENTARY: supplementary alignment

note: descriptions are taken from the samfile documentation

mavis.constants.SPLICE_SITE_RADIUS = 2¶: int – number of bases away from an exon boundary considered to be part of the splice site such that if it were altered the splice site would be considered to be abrogated.

mavis.constants.SPLICE_TYPE = <vocab.vocab.Vocab object>¶

Vocab – holds controlled vocabulary for allowed splice type classification values

RETAIN: an intron was retained
SKIP: an exon was skipped
NORMAL: no exons were skipped and no introns were retained. the normal/expected splicing pattern was followed
MULTI_RETAIN: multiple introns were retained
MULTI_SKIP: multiple exons were skipped
COMPLEX: some combination of exon skipping and intron retention

mavis.constants.START_AA = 'M'¶: str – The amino acid expected to start translation

mavis.constants.STOP_AA = '*'¶: str – The amino acid expected to end translation

mavis.constants.STRAND = <vocab.vocab.Vocab object>¶

Vocab – holds controlled vocabulary for allowed strand values

POS: the positive/forward strand
NEG: the negative/reverse strand
NS: strand is not specified

mavis.constants.SVTYPE = <vocab.vocab.Vocab object>¶

Vocab – holds controlled vocabulary for acceptable structural variant classifications

DEL: deletion
TRANS: translocation
ITRANS: inverted translocation
INV: inversion
INS: insertion
DUP: duplication

mavis.constants.reverse_complement(s)[source]¶

wrapper for the Bio.Seq reverse_complement method

Parameters:	s (str) – the input DNA sequence
Returns:	the reverse complement of the input sequence
Return type:	`str`

Warning

assumes the input is a DNA sequence

Example

>>> reverse_complement('ATCCGGT')
'ACCGGAT'

mavis.constants.sort_columns(input_columns)[source]¶

mavis.constants.translate(s, reading_frame=0)[source]¶

given a DNA sequence, translates it and returns the protein amino acid sequence

Parameters:	s (str) – the input DNA sequence reading_frame (int) – where to start translating the sequence
Returns:	the amino acid sequence
Return type:	str