A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantificationDana Wyman, Gabriela Balderrama-Gutierrez, Fairlie Reese et al.|bioRxiv (Cold Spring Harbor Laboratory)|2019 ABSTRACT Alternative splicing is widely acknowledged to be a crucial regulator of gene expression and is a key contributor to both normal developmental processes and disease states. While cost-effective and accurate for quantification, short-read RNA-seq lacks the ability to resolve full-length transcript isoforms despite increasingly sophisticated computational methods. Long-read sequencing platforms such as Pacific Biosciences (PacBio) and Oxford Nanopore (ONT) bypass the transcript reconstruction challenges of short reads. Here we introduce TALON, the ENCODE4 pipeline for platform-independent analysis of long-read transcriptomes. We apply TALON to the GM12878 cell line and show that while both PacBio and ONT technologies perform well at full-transcript discovery and quantification, each displayed distinct technical artifacts. We further apply TALON to mouse hippocampus and cortex transcriptomes and find that 422 genes found in these regions have more reads associated with novel isoforms than with annotated ones. We demonstrate that TALON is a capable of tracking both known and novel transcript models as well as their expression levels across datasets for both simple studies and in larger projects. These properties will enable TALON users to move beyond the limitations of short-read data to perform isoform discovery and quantification in a uniform manner on existing and future long-read platforms.
Dynamic Gene Regulatory Networks of Human Myeloid DifferentiationThe ENCODE Uniform Analysis PipelinesBenjamin C. Hitz, Jin-Wook Lee, Otto Jolanki et al.|bioRxiv (Cold Spring Harbor Laboratory)|2023 Abstract The Encyclopedia of DNA elements (ENCODE) project is a collaborative effort to create a comprehensive catalog of functional elements in the human genome. The current database comprises more than 19000 functional genomics experiments across more than 1000 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory and transcriptional landscape of the Homo sapiens and Mus musculus genomes. All experimental data, metadata, and associated computational analyses created by the ENCODE consortium are submitted to the Data Coordination Center (DCC) for validation, tracking, storage, and distribution to community resources and the scientific community. The ENCODE project has engineered and distributed uniform processing pipelines in order to promote data provenance and reproducibility as well as allow interoperability between genomic resources and other consortia. All data files, reference genome versions, software versions, and parameters used by the pipelines are captured and available via the ENCODE Portal. The pipeline code, developed using Docker and Workflow Description Language (WDL; https://openwdl.org/ ) is publicly available in GitHub, with images available on Dockerhub ( https://hub.docker.com ), enabling access to a diverse range of biomedical researchers. ENCODE pipelines maintained and used by the DCC can be installed to run on personal computers, local HPC clusters, or in cloud computing environments via Cromwell. Access to the pipelines and data via the cloud allows small labs the ability to use the data or software without access to institutional compute clusters. Standardization of the computational methodologies for analysis and quality control leads to comparable results from different ENCODE collections - a prerequisite for successful integrative analyses. Database URL: https://www.encodeproject.org/
TranscriptClean: variant-aware correction of indels, mismatches and splice junctions in long-read transcriptsDana Wyman, A Mortazavi|Bioinformatics|2018 Motivation: Long-read, single-molecule sequencing platforms hold great potential for isoform discovery and characterization of multi-exon transcripts. However, their high error rates are an obstacle to distinguishing novel transcript isoforms from sequencing artifacts. Therefore, we developed the package TranscriptClean to correct mismatches, microindels and noncanonical splice junctions in mapped transcripts using the reference genome while preserving known variants. Results: Our method corrects nearly all mismatches and indels present in a publically available human PacBio Iso-seq dataset, and rescues 39% of noncanonical splice junctions. Availability and implementation: All Python and R scripts used in this paper are available at https://github.com/dewyman/TranscriptClean.
SARS-CoV-2 variant Delta rapidly displaced variant Alpha in the United States and led to higher viral loadsAlexandre Bolze, Shishi Luo, Simon White et al.|Cell Reports Medicine|2022 We report on the sequencing of 74,348 SARS-CoV-2 positive samples collected across the United States and show that the Delta variant, first detected in the United States in March 2021, made up the majority of SARS-CoV-2 infections by July 1, 2021 and accounted for >99.9% of the infections by September 2021. Not only did Delta displace variant Alpha, which was the dominant variant at the time, it also displaced the Gamma, Iota, and Mu variants. Through an analysis of quantification cycle (Cq) values, we demonstrate that Delta infections tend to have a 1.7× higher viral load compared to Alpha infections (a decrease of 0.8 Cq) on average. Our results are consistent with the hypothesis that the increased transmissibility of the Delta variant could be due to the ability of the Delta variant to establish a higher viral load earlier in the infection as compared to the Alpha variant.