biofx.parser package¶

Submodules¶

biofx.parser.GSCDirectoryTreeParser module¶

@cchng

This module is for parsing POG-like directory structure.

Todo

fix all that hard coded paths

class biofx.parser.GSCDirectoryTreeParser.GSCDirectoryTemplate(config='/home/cchoo/git/indel_merge/venv/lib/python3.6/site-packages/biofx-0.18.3-py3.6.egg/biofx/configs/directories.conf')[source]¶

Bases: object

TEMPLATE_CONFIG = '/home/cchoo/git/indel_merge/venv/lib/python3.6/site-packages/biofx-0.18.3-py3.6.egg/biofx/configs/directories.conf'¶

get_path_template(analysis_type, directory_type)[source]¶

get template to path for a specific analysis type :param analysis_type: analysis type :type analysis_type: string :param directory_type: directory type :type directory_type: string

Returns:	template path that can be formatted with with `format()`
Return type:	p (string)

class biofx.parser.GSCDirectoryTreeParser.GSCDirectoryTree(_id=None, project='POG', root=None, config='/home/cchoo/git/indel_merge/venv/lib/python3.6/site-packages/biofx-0.18.3-py3.6.egg/biofx/configs/tumour_characterization_project.cfg')[source]¶

Bases: object

GSCDirectoryTree class contains methods for retrieving items in the tree.

_id¶: string – Top level identifier. Defaults to None.

project¶: string – Project name. Defaults to POG.

root¶: string – Base path to standard directory. Defaults to None.

config¶: string – Path to project configs. Defaults to configs/tumour_characterization_project.cfg.

DEFAULT_CONFIG = '/home/cchoo/git/indel_merge/venv/lib/python3.6/site-packages/biofx-0.18.3-py3.6.egg/biofx/configs/tumour_characterization_project.cfg'¶

PROJECT_CONFIG = '/home/cchoo/git/indel_merge/venv/lib/python3.6/site-packages/biofx-0.18.3-py3.6.egg/biofx/configs/tumour_characterization_project.cfg'¶

add_expression_analysis(stranded=True)[source]¶

Add expression analysis.

Returns:	number of expression analyses.
Return type:	(int)

add_paired_analysis()[source]¶

Add paired analysis.

Returns:	number of paired analyses.
Return type:	(int)

add_wgs_bams()[source]¶

get_file_with_id(template, library='*', prefix='*')[source]¶

Parameters:	template (string) – a string template library (string) – library prefix (string) – file prefix
Returns:	a list of file paths matching template and other args provided
Return type:	files (list)

get_ids()[source]¶: Retrieves identifier as defined by subdirectory name.

get_rna_libraries()[source]¶

Get rna libraries

Returns:
Return type:	(list)

get_sample_info(library)[source]¶

Get sample info, for example biop1 for library.

Parameters:	library (string) – library ID

get_tumour_content(library)[source]¶

Get tumour content for library.

Parameters:	library (string) – library ID
Returns:	tumour content. None if paired analysis not available and ‘not available’ of it hasn’t been reviewed.
Return type:	(string)

get_wgs_libraries(biotype=None)[source]¶

Get wgs libraries

Parameters:	biotype (string) – NOT implemented.
Returns:
Return type:	(list)

parse_libraries(wgs=True, rna=True)[source]¶: Parses libraries in tree and returns a dictionary

set_id(_id)[source]¶

Set identifier. :param _id: identifier :type _id: string

Raises:	`ValueError` – invalid identifier

biofx.parser.LRGparser module¶

@cchng

This module is for processing LRG transcript resource files, for example ftp://ftp.ebi.ac.uk/pub/databases/lrgex/list_LRGs_transcripts_GRCh37.txt Descriptions of the file is available at http://www.lrg-sequence.org/downloads.

biofx.parser.LRGparser.parse_LRG(infile)[source]¶

Parses LRG file.

Parameters:	infile – input file path

biofx.parser.SnpEffparser module¶

@cchng

This module is for processing SnpEff files and EFF strings.

biofx.parser.SnpEffparser.eff_has_transcript(transcript_id, eff_maps, partial=False)[source]¶

Parameters:	transcript_id (string) – ensembl transcript id eff_maps (list) – output of `parse_effect()` partial (bool) – check based on partial transcript match (in the case of refseq)
Returns:	True if transcript in list of effects
Return type:	(bool)

biofx.parser.SnpEffparser.filter_snpeff_impact(eff_maps, impact_filter=['HIGH', 'MODERATE'], hierarchical=False)[source]¶

Filter snpeff annotations by impact.

Parameters:	eff_maps (list) – output of `parse_effect()` impact_filter (list) – list of impact to be included in output. Defaults to [“HIGH”,MODERATE”]. hierarchical (boolean) – return highest impact only
Returns:	eff_maps format, filtered by impact
Return type:	filtered_by_impact (list)
Raises:	`AssertionError` – Only HIGH, MODERATE, LOW, MODIFIER impacts are accepted, as defined in the Snpeff manual (Section – EFF field).

biofx.parser.SnpEffparser.has_chromosome_error(eff, chromosome, valid_pattern)[source]¶

Check for ERROR_CHROMOSOME_NOT_FOUND

Parameters:	eff (string) – snpeff EFF string chromosome (string) – chromosome to be checked valid_pattern (string) – regex for valid chromosome patterns.

:param see biofx.parser.GSCDefinitions for more info:

Returns:	True if error
Return type:	bool

biofx.parser.SnpEffparser.merge_eff_maps(hgvs, classic, ordered=True, exclude_hgvs_effect_type=['chromosome_number_variation'], check_genotype=True)[source]¶

Merge effect maps

Parameters:	hgvs (list) – list of dictionaries (`parse_effect()` output with hgvs) classic (list) – list of dictionaries (`parse_effect()` output) ordered (bool) – are both hgvs and classic lists in order exclude_hgvs_effect_type (list) – hgvs effect types to exclude check_genotype (bool) – check genotype if True - used when there are multiple alleles
Returns:	merged eff maps
Return type:	merged_eff (list)
Raises:	`ValueError` – if annotation is not in exclude_hgvs_effect type but is of high/moderate importance when there is no transcript assigned. `ValueError` – no classic annotation on transcript for variant, can’t merge

biofx.parser.SnpEffparser.parse_effect(effect, hgvs=False)[source]¶

Parses snpeff effect with the following format:

>>> EFF= Effect ( Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_Change| Amino_Acid_Length | Gene_Name | Transcript_BioType | Gene_Coding | Transcript_ID | Exon_Rank  | Genotype_Number [ | ERRORS | WARNINGS ] )

Parameters:	effect (tuple) – output of `parse_snpeff()`. tuple with effect type as the first element followed by effect descriptions. (A) – hgvs (bool) – True if hgvs effects
Returns:	a dictionary mapping effect key and value.
Return type:	eff (dict)

Example:

>>> effect = ("INTRON","(MODIFIER|||||DDX12P|unprocessed_pseudogene|NON_CODING|ENST00000290818|19|1)")
>>> eff_map = parse_effect(effect)
>>> eff_map
{'classic_protein_sequence_change': '', 'codon_change': '', 'transcript': 'ENST00000290818', 'functional_class': '', 'coding': 'NON_CODING', 'gene_symbol': 'DDX12P', 'gene_biotype': 'unprocessed_pseudogene', 'classic_effect_type': 'INTRON', 'exon': '19', 'amino_acid_length': '', 'effect_impact': 'MODIFIER'}

biofx.parser.SnpEffparser.parse_snpeff(raw_text)[source]¶

Parses snpeff string

Parameters:	raw_text (string) – a string in the format of `Effect ( Effect_Impact \| Functional_Class \| Codon_Change \| Amino_Acid_change\| Amino_Acid_length \| Gene_Name \| Gene_BioType \| Coding \| Transcript \| Exon \| GenotypeNum [ \| ERRORS \| WARNINGS ] )` by snpeff 3.3/4.1. Multiple effects are comma-separated. (generated) –
Returns:	a list of tuples of effects
Return type:	m (list)
Raises:	`AssertionError` – This parser takes EFF formatted annotations only

Notes

In SnpEff 4.* LOF and NMD predictions are added by default. They are separated by semi-colons. See “Loss of function (LOF) and nonsense-mediated decay (NMD) predictions” for more info. Ignoring the LOH and NMD tags for now

Also multiple, ‘+’ separated effects are allowed.

biofx.parser.SnpEffparser.select_eff_by_gene_symbol(gene_symbol, eff_maps, multiple=False)[source]¶

Parameters:	gene_symbol (string) – gene symbol (or any other id used for snpeff annotations eff_maps (list) – list of eff maps (see output of `parse_effect()`) multiple (bool) – True if return multiple eff. Otherwise select first one seen when multiple selections found. False by default.
Returns:	tuple containing: selected_eff (dict/list): if multiple True, returns a list of selected eff with matching transcript alternative_eff (dict)
Return type:	(tuple)
Raises:	`ValueError` – Gene symbol provided should be annotated in eff_maps. nothing to select otherwise. `RuntimeError` – genes before and after selection should be the same

biofx.parser.SnpEffparser.select_eff_by_transcript(transcript_id, eff_maps, multiple=False, partial=False, alt=None, random=False, sort_order=True)[source]¶

Parameters:	transcript_id (string) – ensembl transcript id (or any other id used for snpeff annotations eff_maps (list) – list of eff maps (see output of `parse_effect()`) multiple (bool) – True if return multiple eff. Otherwise select first one seen when multiple selections found. False by default. partial (bool) – True if match transcript ID partially. Not recommended. alt (string) – alt allele random (bool) – select random if True when multiple is False. Typically when multiple alt alleles seen. supercedes alt rule.
Returns:	tuple containing: selected_eff (dict/list): if multiple True, returns a list of selected eff with matching transcript alternative_eff (dict)
Return type:	(tuple)
Raises:	`AssertionError` – transcripts before and after selection should add up

biofx.parser.SnpEffparser.verify_effects(eff_a, eff_b)[source]¶

Compare snpeff effects, generally between Sequence Ontology effects and classic effects type as documented in the Snpeff manual under section ‘Effect prediction details’.

Parameters:	eff_a (dict) – output of `parse_effect()` eff_b (dict) – ouput of `parse_effect()`
Returns:	True if both effects are equivalent.
Return type:	(bool)
Raises:	`AssertionError` – more than one equivalent effect

References

Snpeff http://snpeff.sourceforge.net/SnpEff_manual.html (Retrieved on June 10 2015)

Notes

Effect types in v4.* are sometimes concatenated. i.e. there can be multiple effects on a single transcript. This happens at splice sites. For example:

>>> EFF=missense_variant(MODERATE|MISSENSE|Cgc/Tgc|p.Arg323Cys/c.967C>T|1013|GARNL3|protein_coding|CODING|ENST00000373387|11|T),missense_variant(MODERATE|MISSENSE|Cgc/Tgc|p.Arg301Cys/c.901C>T|991|GARNL3|protein_coding|CODING|ENST00000435213|12|T),missense_variant(MODERATE|MISSENSE|Cgc/Tgc|p.Arg323Cys/c.967C>T|820|GARNL3|protein_coding|CODING|ENST00000314904|11|T),splice_region_variant+non_coding_exon_variant(LOW|||n.736C>T||GARNL3|retained_intron|CODING|ENST00000495172|8|T),downstream_gene_variant(MODIFIER||3135|c.*905C>T|266|GARNL3|protein_coding|CODING|ENST00000439286||T|WARNING_TRANSCRIPT_INCOMPLETE),intron_variant(MODIFIER|||n.425-3424C>T||GARNL3|processed_transcript|CODING|ENST00000464616|5|T),non_coding_exon_variant(MODIFIER|||n.1140C>T||GARNL3|retained_intron|CODING|ENST00000485331|9|T),non_coding_exon_variant(MODIFIER|||n.913C>T||GARNL3|nonsense_mediated_decay|CODING|ENST00000373386|11|T)

Currently the output will be the concatenated effects; code will generate warnings when impact of the different effect types differ (as suggested in the mapping table provided in the Effect prediction details section in the manual). In the example above, ENST00000495172 has a splice_region_variant + non_coding_exon_variant. Splice regions have “low” impact and exon variants are “modifiers”. But we are ignoring the differences and taking the higher snpeff impact (LOW, in this case) as the bona fide impact.

biofx.parser package¶

Submodules¶

biofx.parser.GSCDirectoryTreeParser module¶

biofx.parser.LRGparser module¶

biofx.parser.SnpEffparser module¶

Module contents¶