biofx.parser package

Submodules

biofx.parser.GSCDirectoryTreeParser module

@cchng

This module is for parsing POG-like directory structure.

Todo

fix all that hard coded paths

class biofx.parser.GSCDirectoryTreeParser.GSCDirectoryTemplate(config='/home/cchoo/git/indel_merge/venv/lib/python3.6/site-packages/biofx-0.18.3-py3.6.egg/biofx/configs/directories.conf')[source]

Bases: object

TEMPLATE_CONFIG = '/home/cchoo/git/indel_merge/venv/lib/python3.6/site-packages/biofx-0.18.3-py3.6.egg/biofx/configs/directories.conf'
get_path_template(analysis_type, directory_type)[source]

get template to path for a specific analysis type :param analysis_type: analysis type :type analysis_type: string :param directory_type: directory type :type directory_type: string

Returns:template path that can be formatted with with format()
Return type:p (string)
class biofx.parser.GSCDirectoryTreeParser.GSCDirectoryTree(_id=None, project='POG', root=None, config='/home/cchoo/git/indel_merge/venv/lib/python3.6/site-packages/biofx-0.18.3-py3.6.egg/biofx/configs/tumour_characterization_project.cfg')[source]

Bases: object

GSCDirectoryTree class contains methods for retrieving items in the tree.

_id

string – Top level identifier. Defaults to None.

project

string – Project name. Defaults to POG.

root

string – Base path to standard directory. Defaults to None.

config

string – Path to project configs. Defaults to configs/tumour_characterization_project.cfg.

DEFAULT_CONFIG = '/home/cchoo/git/indel_merge/venv/lib/python3.6/site-packages/biofx-0.18.3-py3.6.egg/biofx/configs/tumour_characterization_project.cfg'
PROJECT_CONFIG = '/home/cchoo/git/indel_merge/venv/lib/python3.6/site-packages/biofx-0.18.3-py3.6.egg/biofx/configs/tumour_characterization_project.cfg'
add_expression_analysis(stranded=True)[source]

Add expression analysis.

Returns:number of expression analyses.
Return type:(int)
add_paired_analysis()[source]

Add paired analysis.

Returns:number of paired analyses.
Return type:(int)
add_wgs_bams()[source]
get_file_with_id(template, library='*', prefix='*')[source]
Parameters:
  • template (string) – a string template
  • library (string) – library
  • prefix (string) – file prefix
Returns:

a list of file paths matching template and other args provided

Return type:

files (list)

get_ids()[source]

Retrieves identifier as defined by subdirectory name.

get_rna_libraries()[source]

Get rna libraries

Returns:
Return type:(list)
get_sample_info(library)[source]

Get sample info, for example biop1 for library.

Parameters:library (string) – library ID
get_tumour_content(library)[source]

Get tumour content for library.

Parameters:library (string) – library ID
Returns:tumour content. None if paired analysis not available and ‘not available’ of it hasn’t been reviewed.
Return type:(string)
get_wgs_libraries(biotype=None)[source]

Get wgs libraries

Parameters:biotype (string) – NOT implemented.
Returns:
Return type:(list)
parse_libraries(wgs=True, rna=True)[source]

Parses libraries in tree and returns a dictionary

set_id(_id)[source]

Set identifier. :param _id: identifier :type _id: string

Raises:ValueError – invalid identifier

biofx.parser.LRGparser module

@cchng

This module is for processing LRG transcript resource files, for example ftp://ftp.ebi.ac.uk/pub/databases/lrgex/list_LRGs_transcripts_GRCh37.txt Descriptions of the file is available at http://www.lrg-sequence.org/downloads.

biofx.parser.LRGparser.parse_LRG(infile)[source]

Parses LRG file.

Parameters:infile – input file path

biofx.parser.SnpEffparser module

@cchng

This module is for processing SnpEff files and EFF strings.

biofx.parser.SnpEffparser.eff_has_transcript(transcript_id, eff_maps, partial=False)[source]
Parameters:
  • transcript_id (string) – ensembl transcript id
  • eff_maps (list) – output of parse_effect()
  • partial (bool) – check based on partial transcript match (in the case of refseq)
Returns:

True if transcript in list of effects

Return type:

(bool)

biofx.parser.SnpEffparser.filter_snpeff_impact(eff_maps, impact_filter=['HIGH', 'MODERATE'], hierarchical=False)[source]

Filter snpeff annotations by impact.

Parameters:
  • eff_maps (list) – output of parse_effect()
  • impact_filter (list) – list of impact to be included in output. Defaults to [“HIGH”,MODERATE”].
  • hierarchical (boolean) – return highest impact only
Returns:

eff_maps format, filtered by impact

Return type:

filtered_by_impact (list)

Raises:
  • AssertionError – Only HIGH, MODERATE, LOW, MODIFIER impacts are accepted,
  • as defined in the Snpeff manual (Section – EFF field).
biofx.parser.SnpEffparser.has_chromosome_error(eff, chromosome, valid_pattern)[source]

Check for ERROR_CHROMOSOME_NOT_FOUND

Parameters:
  • eff (string) – snpeff EFF string
  • chromosome (string) – chromosome to be checked
  • valid_pattern (string) – regex for valid chromosome patterns.

:param see biofx.parser.GSCDefinitions for more info:

Returns:True if error
Return type:bool
biofx.parser.SnpEffparser.merge_eff_maps(hgvs, classic, ordered=True, exclude_hgvs_effect_type=['chromosome_number_variation'], check_genotype=True)[source]

Merge effect maps

Parameters:
  • hgvs (list) – list of dictionaries (parse_effect() output with hgvs)
  • classic (list) – list of dictionaries (parse_effect() output)
  • ordered (bool) – are both hgvs and classic lists in order
  • exclude_hgvs_effect_type (list) – hgvs effect types to exclude
  • check_genotype (bool) – check genotype if True - used when there are multiple alleles
Returns:

merged eff maps

Return type:

merged_eff (list)

Raises:
  • ValueError – if annotation is not in exclude_hgvs_effect type but is of high/moderate importance when
  • there is no transcript assigned.
  • ValueError – no classic annotation on transcript for variant, can’t merge
biofx.parser.SnpEffparser.parse_effect(effect, hgvs=False)[source]

Parses snpeff effect with the following format:

>>> EFF= Effect ( Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_Change| Amino_Acid_Length | Gene_Name | Transcript_BioType | Gene_Coding | Transcript_ID | Exon_Rank  | Genotype_Number [ | ERRORS | WARNINGS ] )
Parameters:
  • effect (tuple) – output of parse_snpeff().
  • tuple with effect type as the first element followed by effect descriptions. (A) –
  • hgvs (bool) – True if hgvs effects
Returns:

a dictionary mapping effect key and value.

Return type:

eff (dict)

Example:

>>> effect = ("INTRON","(MODIFIER|||||DDX12P|unprocessed_pseudogene|NON_CODING|ENST00000290818|19|1)")
>>> eff_map = parse_effect(effect)
>>> eff_map
{'classic_protein_sequence_change': '', 'codon_change': '', 'transcript': 'ENST00000290818', 'functional_class': '', 'coding': 'NON_CODING', 'gene_symbol': 'DDX12P', 'gene_biotype': 'unprocessed_pseudogene', 'classic_effect_type': 'INTRON', 'exon': '19', 'amino_acid_length': '', 'effect_impact': 'MODIFIER'}
biofx.parser.SnpEffparser.parse_snpeff(raw_text)[source]

Parses snpeff string

Parameters:
  • raw_text (string) – a string in the format of Effect ( Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_change| Amino_Acid_length | Gene_Name | Gene_BioType | Coding | Transcript | Exon | GenotypeNum [ | ERRORS | WARNINGS ] )
  • by snpeff 3.3/4.1. Multiple effects are comma-separated. (generated) –
Returns:

a list of tuples of effects

Return type:

m (list)

Raises:

AssertionError – This parser takes EFF formatted annotations only

Notes

In SnpEff 4.* LOF and NMD predictions are added by default. They are separated by semi-colons. See “Loss of function (LOF) and nonsense-mediated decay (NMD) predictions” for more info. Ignoring the LOH and NMD tags for now

Also multiple, ‘+’ separated effects are allowed.

biofx.parser.SnpEffparser.select_eff_by_gene_symbol(gene_symbol, eff_maps, multiple=False)[source]
Parameters:
  • gene_symbol (string) – gene symbol (or any other id used for snpeff annotations
  • eff_maps (list) – list of eff maps (see output of parse_effect())
  • multiple (bool) – True if return multiple eff. Otherwise select first one seen when multiple selections found. False by default.
Returns:

tuple containing:

  • selected_eff (dict/list): if multiple True, returns a list of selected eff with matching transcript
  • alternative_eff (dict)

Return type:

(tuple)

Raises:
  • ValueError – Gene symbol provided should be annotated in eff_maps. nothing to select otherwise.
  • RuntimeError – genes before and after selection should be the same
biofx.parser.SnpEffparser.select_eff_by_transcript(transcript_id, eff_maps, multiple=False, partial=False, alt=None, random=False, sort_order=True)[source]
Parameters:
  • transcript_id (string) – ensembl transcript id (or any other id used for snpeff annotations
  • eff_maps (list) – list of eff maps (see output of parse_effect())
  • multiple (bool) – True if return multiple eff. Otherwise select first one seen when multiple selections found. False by default.
  • partial (bool) – True if match transcript ID partially. Not recommended.
  • alt (string) – alt allele
  • random (bool) – select random if True when multiple is False. Typically when multiple alt alleles seen. supercedes alt rule.
Returns:

tuple containing:

  • selected_eff (dict/list): if multiple True, returns a list of selected eff with matching transcript
  • alternative_eff (dict)

Return type:

(tuple)

Raises:

AssertionError – transcripts before and after selection should add up

biofx.parser.SnpEffparser.verify_effects(eff_a, eff_b)[source]

Compare snpeff effects, generally between Sequence Ontology effects and classic effects type as documented in the Snpeff manual under section ‘Effect prediction details’.

Parameters:
Returns:

True if both effects are equivalent.

Return type:

(bool)

Raises:

AssertionError – more than one equivalent effect

References

Snpeff http://snpeff.sourceforge.net/SnpEff_manual.html (Retrieved on June 10 2015)

Notes

Effect types in v4.* are sometimes concatenated. i.e. there can be multiple effects on a single transcript. This happens at splice sites. For example:

>>> EFF=missense_variant(MODERATE|MISSENSE|Cgc/Tgc|p.Arg323Cys/c.967C>T|1013|GARNL3|protein_coding|CODING|ENST00000373387|11|T),missense_variant(MODERATE|MISSENSE|Cgc/Tgc|p.Arg301Cys/c.901C>T|991|GARNL3|protein_coding|CODING|ENST00000435213|12|T),missense_variant(MODERATE|MISSENSE|Cgc/Tgc|p.Arg323Cys/c.967C>T|820|GARNL3|protein_coding|CODING|ENST00000314904|11|T),splice_region_variant+non_coding_exon_variant(LOW|||n.736C>T||GARNL3|retained_intron|CODING|ENST00000495172|8|T),downstream_gene_variant(MODIFIER||3135|c.*905C>T|266|GARNL3|protein_coding|CODING|ENST00000439286||T|WARNING_TRANSCRIPT_INCOMPLETE),intron_variant(MODIFIER|||n.425-3424C>T||GARNL3|processed_transcript|CODING|ENST00000464616|5|T),non_coding_exon_variant(MODIFIER|||n.1140C>T||GARNL3|retained_intron|CODING|ENST00000485331|9|T),non_coding_exon_variant(MODIFIER|||n.913C>T||GARNL3|nonsense_mediated_decay|CODING|ENST00000373386|11|T)

Currently the output will be the concatenated effects; code will generate warnings when impact of the different effect types differ (as suggested in the mapping table provided in the Effect prediction details section in the manual). In the example above, ENST00000495172 has a splice_region_variant + non_coding_exon_variant. Splice regions have “low” impact and exon variants are “modifiers”. But we are ignoring the differences and taking the higher snpeff impact (LOW, in this case) as the bona fide impact.

Module contents