protein module

class mavis.annotate.protein.Domain(name, regions, translation=None, data=None)[source]

Bases: object

Parameters:
  • name (str) – the name of the domain i.e. PF00876
  • regions (list of DomainRegion) – the amino acid ranges that are part of the domain
  • transcript (Transcript) – the ‘parent’ transcript this domain belongs to
Raises:

AttributeError – if the end of any region is less than the start

Example

>>> Domain('DNA binding domain', [(1, 4), (10, 24)], transcript)
align_seq(input_sequence, reference_genome=None, min_region_match=0.5)[source]

align each region to the input sequence starting with the last one. then take the subset of sequence that remains to align the second last and so on return a list of intervals for the alignment. If multiple alignments are found, then raise an error

Parameters:
  • input_sequence (str) – the sequence to be aligned to
  • reference_genome (dict of Bio.SeqRecord by str) – dict of reference sequence by template/chr name
  • min_region_match (float) – percent between 0 and 1. Each region must have a score len(seq) * min_region_match
Returns:

tuple contains

  • int: the number of matches
  • int: the total number of amino acids to be aligned
  • list of DomainRegion: the list of domain regions on the new input sequence

Return type:

tuple

Raises:
  • AttributeError – if sequence information is not available
  • UserWarning – if a valid alignment could not be found or no best alignment was found
get_seqs(reference_genome=None, ignore_cache=False)[source]

returns the amino acid sequences for each of the domain regions associated with this domain in the order of the regions (sorted by start)

Parameters:reference_genome (dict of Bio.SeqRecord by str) – dict of reference sequence by template/chr name
Returns:list of amino acid sequences for each DomainRegion
Return type:list of str
Raises:AttributeError – if there is not enough sequence information given to determine this
key()[source]

tuple: a tuple representing the items expected to be unique. for hashing and comparing

score_region_mapping(reference_genome=None)[source]

compares the sequence in each DomainRegion to the sequence collected for that domain region from the translation object

Parameters:reference_genome (dict of Bio.SeqRecord by str) – dict of reference sequence by template/chr name
Returns:tuple contains
  • int: the number of matching amino acids
  • int: the total number of amino acids
Return type:tuple of int and int
translation

Translation – the Translation this domain belongs to

class mavis.annotate.protein.DomainRegion(start, end, seq=None, domain=None, name=None)[source]

Bases: mavis.annotate.base.BioInterval

class mavis.annotate.protein.Translation(start, end, transcript=None, domains=None, seq=None, name=None)[source]

Bases: mavis.annotate.base.BioInterval

describes the splicing pattern and cds start and end with reference to a particular transcript

Parameters:
  • start (int) – start of the coding sequence (cds) relative to the start of the first exon in the transcript
  • end (int) – end of the coding sequence (cds) relative to the start of the first exon in the transcript
  • transcript (Transcript) – the transcript this is a Translation of
  • domains (list of Domain) – a list of the domains on this translation
  • sequence (str) – the cds sequence
convert_aa_to_cdna(pos)[source]
Parameters:pos (int) – the amino acid position
Returns:the cdna equivalent (with CODON_SIZE uncertainty)
Return type:Interval
convert_cdna_to_aa(pos)[source]
Parameters:pos (int) – the cdna position
Returns:the protein/amino-acid position
Return type:int
Raises:AttributeError – the cdna position is not translated
convert_genomic_to_cds(pos)[source]

converts a genomic position to its cds (coding sequence) equivalent

Parameters:pos (int) – the genomic position
Returns:the cds position (negative if before the initiation start site)
Return type:int
convert_genomic_to_cds_notation(pos)[source]

converts a genomic position to its cds (coding sequence) equivalent using hgvs cds notation

Parameters:pos (int) – the genomic position
Returns:the cds position notation
Return type:str

Example

>>> tl = Translation(...)
# a position before the translation start
>>> tl.convert_genomic_to_cds_notation(1010)
'-50'
# a position after the translation end
>>> tl.convert_genomic_to_cds_notation(2031)
'*72'
# an intronic position
>>> tl.convert_genomic_to_cds_notation(1542)
'50+10'
>>> tl.convert_genomic_to_cds_notation(1589)
'51-14'
convert_genomic_to_nearest_cds(pos)[source]

converts a genomic position to its cds equivalent or (if intronic) the nearest cds and shift

Parameters:pos (int) – the genomic position
Returns:
  • int - the cds position
  • int - the intronic shift
Return type:tuple of int and int
get_aa_seq(reference_genome=None, ignore_cache=False)[source]
Parameters:reference_genome (dict of Bio.SeqRecord by str) – dict of reference sequence by template/chr name
Returns:the amino acid sequence
Return type:str
Raises:AttributeError – if the reference sequence has not been given and is not set
get_cds_seq(reference_genome=None, ignore_cache=False)[source]
Parameters:reference_genome (dict of Bio.SeqRecord by str) – dict of reference sequence by template/chr name
Returns:the cds sequence
Return type:str
Raises:AttributeError – if the reference sequence has not been given and is not set
get_seq(reference_genome=None, ignore_cache=False)[source]

wrapper for the sequence method

Parameters:reference_genome (dict of Bio.SeqRecord by str) – dict of reference sequence by template/chr name
key()[source]

see structural_variant.annotate.base.BioInterval.key()

transcript

Transcript – the spliced transcript this translation belongs to

mavis.annotate.protein.calculate_orf(spliced_cdna_sequence, min_orf_size=None)[source]

calculate all possible open reading frames given a spliced cdna sequence (no introns)

Parameters:spliced_cdna_sequence (str) – the sequence
Returns:list of open reading frame positions on the input sequence
Return type:list of Interval