protein module¶

class mavis.annotate.protein.Domain(name, regions, translation=None, data=None)[source]¶

Bases: object

Parameters:	name (str) – the name of the domain i.e. PF00876 regions (`list` of `DomainRegion`) – the amino acid ranges that are part of the domain transcript (Transcript) – the ‘parent’ transcript this domain belongs to
Raises:	`AttributeError` – if the end of any region is less than the start

Example

>>> Domain('DNA binding domain', [(1, 4), (10, 24)], transcript)

align_seq(input_sequence, reference_genome=None, min_region_match=0.5)[source]¶

align each region to the input sequence starting with the last one. then take the subset of sequence that remains to align the second last and so on return a list of intervals for the alignment. If multiple alignments are found, then raise an error

Parameters:	input_sequence (str) – the sequence to be aligned to reference_genome (`dict` of `Bio.SeqRecord` by `str`) – dict of reference sequence by template/chr name min_region_match (float) – percent between 0 and 1. Each region must have a score len(seq) * min_region_match
Returns:	tuple contains int: the number of matches int: the total number of amino acids to be aligned `list` of `DomainRegion`: the list of domain regions on the new input sequence
Return type:	tuple
Raises:	`AttributeError` – if sequence information is not available `UserWarning` – if a valid alignment could not be found or no best alignment was found

get_seqs(reference_genome=None, ignore_cache=False)[source]¶

returns the amino acid sequences for each of the domain regions associated with this domain in the order of the regions (sorted by start)

Parameters:	reference_genome (`dict` of `Bio.SeqRecord` by `str`) – dict of reference sequence by template/chr name
Returns:	list of amino acid sequences for each DomainRegion
Return type:	`list` of `str`
Raises:	`AttributeError` – if there is not enough sequence information given to determine this

key()[source]¶: tuple: a tuple representing the items expected to be unique. for hashing and comparing

score_region_mapping(reference_genome=None)[source]¶

compares the sequence in each DomainRegion to the sequence collected for that domain region from the translation object

Parameters:	reference_genome (`dict` of `Bio.SeqRecord` by `str`) – dict of reference sequence by template/chr name
Returns:	tuple contains int: the number of matching amino acids int: the total number of amino acids
Return type:	tuple of int and int

translation¶: Translation – the Translation this domain belongs to

class mavis.annotate.protein.DomainRegion(start, end, seq=None, domain=None, name=None)[source]¶: Bases: mavis.annotate.base.BioInterval

class mavis.annotate.protein.Translation(start, end, transcript=None, domains=None, seq=None, name=None)[source]¶

Bases: mavis.annotate.base.BioInterval

describes the splicing pattern and cds start and end with reference to a particular transcript

Parameters:

start (int) – start of the coding sequence (cds) relative to the start of the first exon in the transcript
end (int) – end of the coding sequence (cds) relative to the start of the first exon in the transcript
transcript (Transcript) – the transcript this is a Translation of
domains (list of Domain) – a list of the domains on this translation
sequence (str) – the cds sequence

convert_aa_to_cdna(pos)[source]¶

Parameters:	pos (int) – the amino acid position
Returns:	the cdna equivalent (with CODON_SIZE uncertainty)
Return type:	Interval

convert_cdna_to_aa(pos)[source]¶

Parameters:	pos (int) – the cdna position
Returns:	the protein/amino-acid position
Return type:	int
Raises:	`AttributeError` – the cdna position is not translated

convert_genomic_to_cds(pos)[source]¶

converts a genomic position to its cds (coding sequence) equivalent

Parameters:	pos (int) – the genomic position
Returns:	the cds position (negative if before the initiation start site)
Return type:	int

convert_genomic_to_cds_notation(pos)[source]¶

converts a genomic position to its cds (coding sequence) equivalent using hgvs cds notation

Parameters:	pos (int) – the genomic position
Returns:	the cds position notation
Return type:	str

Example

>>> tl = Translation(...)
# a position before the translation start
>>> tl.convert_genomic_to_cds_notation(1010)
'-50'
# a position after the translation end
>>> tl.convert_genomic_to_cds_notation(2031)
'*72'
# an intronic position
>>> tl.convert_genomic_to_cds_notation(1542)
'50+10'
>>> tl.convert_genomic_to_cds_notation(1589)
'51-14'

convert_genomic_to_nearest_cds(pos)[source]¶

converts a genomic position to its cds equivalent or (if intronic) the nearest cds and shift

Parameters:	pos (int) – the genomic position
Returns:	int - the cds position int - the intronic shift
Return type:	tuple of int and int

get_aa_seq(reference_genome=None, ignore_cache=False)[source]¶

Parameters:	reference_genome (`dict` of `Bio.SeqRecord` by `str`) – dict of reference sequence by template/chr name
Returns:	the amino acid sequence
Return type:	str
Raises:	`AttributeError` – if the reference sequence has not been given and is not set

get_cds_seq(reference_genome=None, ignore_cache=False)[source]¶

Parameters:	reference_genome (`dict` of `Bio.SeqRecord` by `str`) – dict of reference sequence by template/chr name
Returns:	the cds sequence
Return type:	str
Raises:	`AttributeError` – if the reference sequence has not been given and is not set

get_seq(reference_genome=None, ignore_cache=False)[source]¶

wrapper for the sequence method

Parameters:	reference_genome (`dict` of `Bio.SeqRecord` by `str`) – dict of reference sequence by template/chr name

key()[source]¶: see structural_variant.annotate.base.BioInterval.key()

transcript¶: Transcript – the spliced transcript this translation belongs to

mavis.annotate.protein.calculate_orf(spliced_cdna_sequence, min_orf_size=None)[source]¶

calculate all possible open reading frames given a spliced cdna sequence (no introns)

Parameters:	spliced_cdna_sequence (str) – the sequence
Returns:	list of open reading frame positions on the input sequence
Return type:	`list` of `Interval`