blat module¶
“In general the coordinates in psl files are “zero based half open.” The first base in a sequence is numbered zero rather than one. When representing a range the end coordinate is not included in the range. Thus the first 100 bases of a sequence are represented as 0-100, and the second 100 bases are represented as 100-200. There is another little unusual feature in the .psl format. It has to do with how coordinates are handled on the negative strand. In the qStart/qEnd fields the coordinates are where it matches from the point of view of the forward strand (even when the match is on the reverse strand). However on the qStarts[] list, the coordinates are reversed.” — http://wiki.bits.vib.be/index.php/Blat
-
class
mavis.blat.
Blat
[source]¶ Bases:
object
-
static
millibad
(row, is_protein=False, is_mrna=True)[source]¶ this function is used in calculating percent identity direct translation of the perl code # https://genome.ucsc.edu/FAQ/FAQblat.html#blat4
-
static
pslx_row_to_pysam
(row, bam_cache, reference_genome)[source]¶ given a ‘row’ from reading a pslx file. converts the row to a BlatAlignedSegment object
Parameters:
-
static
score
(row, is_protein=False)[source]¶ direct translation from ucsc guidelines on replicating the web blat score https://genome.ucsc.edu/FAQ/FAQblat.html#blat4
below are lines from the perl code i’ve re-written in python
my $sizeMul = pslIsProtein($blockCount, $strand, $tStart, $tEnd, $tSize, $tStarts, $blockSizes); sizmul = 1 for DNA my $pslScore = $sizeMul * ($matches + ($repMatches >> 1) ) - $sizeMul * $misMatches - $qNumInsert - $tNumIns ert)
-
static
-
class
mavis.blat.
BlatAlignedSegment
(reference_name=None, blat_score=None)[source]¶ Bases:
pysam.calignedsegment.AlignedSegment
Parameters: row ( dict
bystr
) – a row dictionary from the Blat.read_pslx method-
query_coverage_interval
()[source]¶ Returns: The portion of the original query sequence that is aligned by this read Return type: Interval
-
reference_name
¶
-
-
mavis.blat.
blat_contigs
(evidence, INPUT_BAM_CACHE, reference_genome, blat_pslx_output_file='blat_out.pslx', blat_fa_input_file='blat_in.fa', blat_2bit_reference='/home/pubseq/genomes/Homo_sapiens/GRCh37/blat/hg19.2bit', blat_min_percent_of_max_score=0.8, blat_min_identity=0.7, blat_min_query_consumption=0.5, is_protein=False, min_extend_overlap=10, pair_scoring_function=<function paired_alignment_score>, clean_files=True, log=<function <lambda>>, **kwargs)[source]¶ given a set of contigs, call blat from the command line and adds the results to the contigs associated with each Evidence object
Parameters: - evidence (list of Evidence) – the iterable container of of Evidence object which has associated contigs
- INPUT_BAM_CACHE (BamCache) – the bam to use as a template in generating bam-like reads
- reference_genome (
dict
ofBio.SeqRecord
bystr
) – dict of reference sequence by template/chr name - blat_2bit_reference (str) – path to the 2bit file for blat
- blat_min_percent_of_max_score (float) – ignores all alignments with a score less
- blat_min_identity (float) – minimum percent identity
- is_protein (bool) – is the sequence an amino acid sequence (used in the blat calculations)
- min_extend_overlap (int) – minimum amount of non-shared coverage of the template sequence required to pair alignments
- =blat_options (list of str) – optional, can specify alternate blat parameters to use
Todo
add support for blatting protein sequences allow one alignment to be lower score if its partner is higher score? or recompute a combined score?