A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification

Dana Wyman; Gabriela Balderrama-Gutierrez; Fairlie Reese; Shan Jiang; Sorena Rahmanian; Stefânia Forner; Dina P. Matheos; Weihua Zeng; Brian A. Williams; Diane Trout; Whitney England; Shu‐Hui Chu; Robert C. Spitale; Andrea J. Tenner; B Wold; A Mortazavi

doi:10.1101/672931

A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification

Dana Wyman(University of California, Irvine), Gabriela Balderrama-Gutierrez(University of California, Irvine), Fairlie Reese(University of California, Irvine), Shan Jiang(University of California, Irvine), Sorena Rahmanian(University of California, Irvine), Stefânia Forner(University of California, Irvine), Dina P. Matheos(University of California, Irvine), Weihua Zeng(University of California, Irvine), Brian A. Williams(California Institute of Technology), Diane Trout(California Institute of Technology), Whitney England(University of California, Irvine), Shu‐Hui Chu(University of California, Irvine), Robert C. Spitale(University of California, Irvine), Andrea J. Tenner(University of California, Irvine), B Wold(California Institute of Technology), A Mortazavi(University of California, Irvine)

bioRxiv (Cold Spring Harbor Laboratory)

June 18, 2019

10.1101/672931

Cited by 192Open Access

Full Text

Abstract

ABSTRACT Alternative splicing is widely acknowledged to be a crucial regulator of gene expression and is a key contributor to both normal developmental processes and disease states. While cost-effective and accurate for quantification, short-read RNA-seq lacks the ability to resolve full-length transcript isoforms despite increasingly sophisticated computational methods. Long-read sequencing platforms such as Pacific Biosciences (PacBio) and Oxford Nanopore (ONT) bypass the transcript reconstruction challenges of short reads. Here we introduce TALON, the ENCODE4 pipeline for platform-independent analysis of long-read transcriptomes. We apply TALON to the GM12878 cell line and show that while both PacBio and ONT technologies perform well at full-transcript discovery and quantification, each displayed distinct technical artifacts. We further apply TALON to mouse hippocampus and cortex transcriptomes and find that 422 genes found in these regions have more reads associated with novel isoforms than with annotated ones. We demonstrate that TALON is a capable of tracking both known and novel transcript models as well as their expression levels across datasets for both simple studies and in larger projects. These properties will enable TALON users to move beyond the limitations of short-read data to perform isoform discovery and quantification in a uniform manner on existing and future long-read platforms.

Related Papers

Basic local alignment search tool

Stephen F. Altschul, Warren Gish, Webb Miller et al.|Journal of Molecular Biology|1990|94.2k

BEDTools: a flexible suite of utilities for comparing genomic features

Aaron R. Quinlan, Ira M. Hall|Bioinformatics|2010|30.3k

The Genotype-Tissue Expression (GTEx) project.

John T. Lonsdale|PubMed|2013|10k

REVIGO Summarizes and Visualizes Long Lists of Gene Ontology Terms

Fran Supek, Matko Bošnjak, Nives Škunca et al.|PLoS ONE|2011|6.9k

Full-length RNA-seq from single cells using Smart-seq2

Simone Picelli, Omid R. Faridani, Åsa K. Björklund et al.|Nature Protocols|2014|4.5k