Statistics and figures for sequencing library: HS1810 (Frontal_mRNA)
Summary of input lanes for this library
Basic info is provided for each lane of data comprising this library
Flowcell-Lane | Read Length | Quality Reads | Source Dir |
700B5AAXX_Lane2 | 2 x 76 | 47,348,587 | /projects/analysis/analysis5/HS1810/700B5AAXX_2/basecalls/ |
700B5AAXX_Lane1 | 2 x 76 | 46,185,469 | /projects/analysis/analysis5/HS1810/700B5AAXX_1/basecalls/ |
612EBAAXX_Lane3 | 2 x 76 | 30,947,902 | /projects/analysis/analysis5/HS1810/612EBAAXX_3/bustard/ |
Summary of sequence statistics for each lane
Basic statistics for each lane of sequence data are summarized below. 'Low quality' refers to reads with too many ambiguous bases (N's). 'Low complexity' refers to homopolymeric (e.g. polyA tails) or other sequences of low complexity identified by 'mdust'. 'R1' and 'R2' refers to the first and second read of a paired-end read.
Flowcell-Lane | Paired Reads | Low Quality (R1 | R2) | Low Complexity (R1 | R2) | Passing (R1 | R2) | Total Bases | Passing Bases |
700B5AAXX_Lane2 | 24,771,189 | 3.10% | 3.00% | 0.55% | 2.20% | 96.35% | 94.80% | 3,765,220,728 | 92.92% |
700B5AAXX_Lane1 | 24,304,280 | 4.21% | 3.16% | 0.52% | 2.07% | 95.26% | 94.77% | 3,694,250,560 | 93.29% |
612EBAAXX_Lane3 | 21,965,862 | 15.15% | 42.34% | 0.55% | 1.07% | 84.30% | 56.59% | 3,338,811,024 | 84.28% |
Summary of read assignments by assignment class
The assignment of reads to each read class is summarized below for the entire library. 'Unassigned' reads could not be assigned with high confidence to any known or predicted transcriptome or genome sequence. This does not mean that they had no significant similarity, only that they could not be assigned with high confidence. In this table, reads are summarized as individual reads (not paired reads)
Read Class | Read Count | Percent of Total |
Total read count | 142,082,668 | 100% |
Read1-Read2 identical | 2,048 | 0.00% |
Low complexity | 1,667,508 | 1.17% |
Low quality (>1 N) | 15,931,148 | 11.21% |
Ensembl transcript | 51,327,954 | 36.13% |
Ensembl transcript (ambiguous) | 4,587,471 | 3.23% |
Novel exon junction | 390,981 | 0.28% |
Novel exon junction (ambiguous) | 16,320 | 0.01% |
Novel exon boundary extension | 1,359,267 | 0.96% |
Novel exon boundary extension (ambiguous) | 224,318 | 0.16% |
Intron | 6,681,217 | 4.70% |
Intron (ambiguous) | 112,826 | 0.08% |
Intergenic | 2,560,805 | 1.80% |
Intergenic (ambiguous) | 41,023 | 0.03% |
Repeat element | 4,444,747 | 3.13% |
Repeat element (ambiguous) | 287,117 | 0.20% |
Unassigned | 52,447,912 | 36.91% |
Summary of mapping results for each lane
The mapping of reads to genome and transcriptome sequences is summarized below on a lane-by-lane basis. Only reads mapped with high confidence are summarized here. Reads mapped to each sequence type were assigned unambiguously. Reads mapping ambiguously (i.e. map equally well to multiple places) are summarized in the last column.
Flowcell-Lane | Total | Transcript | Novel junction | Novel boundary | Intron | Intergenic | Repeat element | Ambiguous |
612EBAAXX_Lane3 | 17,185,427 | 70.20% | 0.65% | 1.95% | 9.22% | 3.53% | 6.33% | 8.11% |
700B5AAXX_Lane1 | 27,277,790 | 71.60% | 0.49% | 1.85% | 9.27% | 3.56% | 6.18% | 7.06% |
700B5AAXX_Lane2 | 27,570,829 | 71.57% | 0.53% | 1.89% | 9.31% | 3.56% | 6.06% | 7.07% |
TOTAL | 72,034,046 | 71.26% | 0.54% | 1.89% | 9.28% | 3.55% | 6.17% | 7.31% |
Summary of average coverage values by feature type
The grand average coverage observed for each type of sequence feature is summarized below. Average coverage is calculated as the cumulative number of mapped bases divided by the total number of base positions in the genome (for that type of sequence feature).
Feature Type | Average Coverage | Total Base Count | Cumulative Coverage |
Gene | 37.84322 | 103,087,317 | 3,901,156,299 |
Transcript | 37.55661 | 64,444,159 | 2,420,304,027 |
ExonRegion | 37.84325 | 103,087,246 | 3,901,156,299 |
Junction | 1.23135 | 204,920,540 | 252,328,656 |
KnownJunction | 14.16680 | 17,657,352 | 250,148,190 |
NovelJunction | 0.01164 | 187,263,188 | 2,180,466 |
Boundary | 4.90142 | 37,074,326 | 181,716,759 |
KnownBoundary | 14.35760 | 6,833,888 | 98,118,248 |
NovelBoundary | 2.76446 | 30,240,438 | 83,598,511 |
Intron | 0.39815 | 1,275,364,373 | 507,780,853 |
ActiveIntronRegion | 1.76467 | 72,257,746 | 127,511,008 |
SilentIntronRegion | 0.31613 | 1,202,399,357 | 380,120,468 |
Intergenic | 0.11831 | 1,643,005,218 | 194,387,730 |
ActiveIntergenicRegion | 3.85654 | 26,791,623 | 103,322,866 |
SilentIntergenicRegion | 0.05633 | 1,616,101,525 | 91,028,982 |
Summary of expressed events by feature type
The number of sequence features of each type detected above the level of background noise are summarized below. For each feature type, the total number of features is listed, followed by the subset of these features that are expressed versus not expressed above background.
Feature Type | Feature Count | Expressed (%) | Not Expressed (%) |
Gene | 49,868 | 13,432 (26.94%) | 36,436 (73.06%) |
Transcript | 134,413 | 13,172 (9.80%) | 121,241 (90.20%) |
ExonRegion | 442,847 | 169,146 (38.20%) | 273,701 (61.80%) |
Junction | 3,305,171 | 112,515 (3.40%) | 3,192,656 (96.60%) |
KnownJunction | 284,797 | 110,987 (38.97%) | 173,810 (61.03%) |
NovelJunction | 3,020,375 | 1,528 (0.05%) | 3,018,847 (99.95%) |
Boundary | 597,974 | 65,158 (10.90%) | 532,816 (89.10%) |
KnownBoundary | 110,225 | 35,633 (32.33%) | 74,592 (67.67%) |
NovelBoundary | 487,750 | 29,525 (6.05%) | 458,225 (93.95%) |
Intron | 235,634 | 4,609 (1.96%) | 231,025 (98.04%) |
ActiveIntronRegion | 237,209 | 11,188 (4.72%) | 226,021 (95.28%) |
SilentIntronRegion | 339,425 | 6,064 (1.79%) | 333,361 (98.21%) |
Intergenic | 30,261 | 790 (2.61%) | 29,471 (97.39%) |
ActiveIntergenic | 81,007 | 5,232 (6.46%) | 75,775 (93.54%) |
SilentIntergenic | 82,083 | 2,235 (2.72%) | 79,848 (97.28%) |
GRAND TOTAL (non-redundant) | 5,535,892 | 403,541 (7.29%) | 5,132,351 (92.71%) |
Estimates of signal-to-noise ratio
(Average coverage of exon regions / Average coverage of silent intron regions) = 119.71
(Average coverage of exon regions / Average coverage of silent intergenic regions) = 671.81
Estimates of intronic and intergenic noise levels (95th percentiles of silent intron and intergenic regions)
95th percentile of silent intron regions for library (HS1810) is: 9.04 (log2 = 3.18)
95th percentile of silent intergenic regions for library (HS1810) is: 7.9 (log2 = 2.98)
Summary of library complexity - estimated by tag redundancy per million reads and compared to other libraries
Library complexity is calculated for the sequence library by randomly sampling 1 million reads and determining the number of unique and redundant reads within the pool. This sampling is repeated (with replacement) at least 3 times and average values across these samples is used. A 'redundant' read is one where both reads of a pair have either identical sequence (first figure) or mapping location (second figure) to at least one other read in the sampled pool. In each panel below, the value for the current library is indicated by a red dot and compared to other libraries, summarized as box and whisker plots. In the first panel (green), the percent of unique reads per million reads sampled is summarized. In the second panel (orange), the percent of redundant reads per million is summarized. In the third panel (yellow), the redundant reads are further examined to determine the number of distinct redundant reads. For example, if all redudant reads corresponded to a few sequences occuring many times this would result in a low number of distinct redundant reads. A 'good' library with high complexity should have high, low, and high values for panels one, two and three respectively.
Library complexity assessed by read sequence identity
Library complexity assessed by read sequence mapping positions
Distribution of relative positions of reads mapping within known transcripts (i.e. position bias test)
The frequency of read positions are plotted against the relative position within each transcript, where 0% is the 5' end of the transcript and 100% is the 3' end. Read positions were binned according to the size of transcripts they map within and the plot is produced for each bin (i.e. reads mapping to transcripts of 0-500 bp, 500-1000 bp, ..., 15000-20000 bp, and > 20000 bp).
Distribution of fragment size for paired-end mappings to transcripts
The distribution of fragment sizes is plotted for all paired-end reads in which both pairs could be aligned to the transcriptome. Pairs with reads aligning to different gene loci or different chromosomes are excluded. Fragment sizes with less than 0.01% of total reads are not plotted.
Distribution % of gene bases covered for each expressed gene (at various minimum depth levels)
The distribution of percent coverage levels for each gene is summarized as a box and whisker plot. These plots are produced for six minimum depth requirements. For example, in the first plot (>=1X), the percentage of bases covered to a depth of 1X or greater is determined for each gene. If a gene is completely covered by reads from the first base to the last, at a depth of 1X or greater, then this gene is given a value of 100%. The distribution of these percent values for all genes detected above background is summarized by the box and whisker plot.
Distribution of log2 raw expression values for each feature type
Box-and-whisker plots for log2 expression values for each feature type.
Distribution of log2 normalized expression values for each feature type
Box-and-whisker plots for normalized log2 expression values for each feature type.
Density scatter plot of exon region versus gene expression values
Density scatter plot of log2 expression values for exon regions versus corresponding gene expression values.
Density scatter plot of silent intron region versus gene expression values
Density scatter plot of log2 expression values for silent intron regions versus corresponding gene expression values.
Distribution of gene-by-gene expression cutoff values
Histogram depicting the distribution of expression cutoff values used for each gene. In order to be considered 'expressed above background', all features (genes, transcripts, exons, junctions, etc.) must be expressed above the level of INTERGENIC noise. This INTERGENIC cutoff is the 95th percentile of expression values for all Silent Intergenic Regions in this library and is depicted below as a dotted red line. The number of genes for which only INTERGENIC noise is considered is provided in the legend. For features within the boundaries of highly expressed genes, additional noise is expected due to the presence of un-processed RNA contamination. For this reason, a higher INTRAGENIC cutoff is determined. These are calculated by fitting a linear model to the 95th percentile of expression values for silent intronic regions. The INTRAGENIC cutoff for a gene is then determined by using the gene expression level and the coefficients of the model fit. The distribution of the resulting gene-by-gene cutoffs is depicted below as a histogram. The number of genes that required an INTRAGENIC cutoff is also indicated in the legend.
Histograms of expression values for each feature type
Histograms depicting distribution of log2 expression values for individual feature types. The 95th percentile of expression values for silent intergenic region is depicted as a dotted line on all plots.
Percentiles plot for expression of exon regions, silent intron regions and silent intergenic regions
Percentiles plot for exon region, silent intron region and silent intergenic region expression values. The 95th percentiles of intronic and intergenic distributions are depicted as colored, dotted lines
Cumulative distribution of mean Phred score for all reads
CDF plot depicting cumulative distribution of mean Phred score for all reads broken down by R1/R2 and sequencing lane