file_io module¶
module which holds all functions relating to loading reference files
-
mavis.annotate.file_io.
convert_tab_to_json
(filepath, warn=<function devnull>)[source]¶ given a file in the std input format (see below) reads and return a list of genes (and sub-objects)
column name example description ensembl_transcript_id ENST000001 ensembl_gene_id ENSG000001 strand -1 positive or negative 1 cdna_coding_start 44 where translation begins relative to the start of the cdna cdna_coding_end 150 where translation terminates genomic_exon_ranges 100-201;334-412;779-830 semi-colon demitited exon start/ends AA_domain_ranges DBD:220-251,260-271 semi-colon delimited list of domains hugo_names KRAS hugo gene name Parameters: filepath (str) – path to the input tab-delimited file Returns: - a dictionary keyed by chromosome name with
- values of list of genes on the chromosome
Return type: dict
oflist
ofGene
bystr
Example
>>> ref = load_reference_genes('filename') >>> ref['1'] [Gene(), Gene(), ....]
Warning
does not load translations unless then start with ‘M’, end with ‘*’ and have a length of multiple 3
-
mavis.annotate.file_io.
load_annotations
(filepath, warn=<function devnull>, REFERENCE_GENOME=None, filetype=None, best_transcripts_only=False)[source]¶ loads gene models from an input file. Expects a tabbed or json file.
Parameters: - filepath (str) – path to the input file
- verbose (bool) – output extra information to stdout
- REFERENCE_GENOME (
dict
ofBio.SeqRecord
bystr
) – dict of reference sequence by template/chr name - filetype (str) – json or tab/tsv. only required if the file type can’t be interpolated from the path extenstion
Returns: lists of genes keyed by chromosome name
Return type:
-
mavis.annotate.file_io.
load_masking_regions
(filepath)[source]¶ reads a file of regions. The expect input format for the file is tab-delimited and the header should contain the following columns
- chr: the chromosome
- start: start of the region, 1-based inclusive
- end: end of the region, 1-based inclusive
- name: the name/label of the region
For example:
#chr start end name chr20 25600000 27500000 centromere
Parameters: filepath (str) – path to the input tab-delimited file Returns: a dictionary keyed by chromosome name with values of lists of regions on the chromosome Return type: dict
oflist
ofBioInterval
bystr
Example
>>> m = load_masking_regions('filename') >>> m['1'] [BioInterval(), BioInterval(), ...]
-
mavis.annotate.file_io.
load_reference_genome
(filename, low_mem=False)[source]¶ Parameters: filename (str) – the path to the file containing the input fasta genome Returns: - a dictionary representing the sequences in the
- fasta file
Return type: dict
ofBio.SeqRecord
bystr
-
mavis.annotate.file_io.
load_templates
(filename)[source]¶ primarily useful if template drawings are required and is not necessary otherwise assumes the input file is 0-indexed with [start,end) style. Columns are expected in the following order, tab-delimited. A header should not be given
- name
- start
- end
- band_name
- giesma_stain
for example
chr1 0 2300000 p36.33 gneg chr1 2300000 5400000 p36.32 gpos25
Parameters: filename (str) – the path to the file with the cytoband template information Returns: list of the templates loaded Return type: list
ofTemplate