The ENCODE4 long-read RNA-seq collection reveals distinct classes of transcript structure diversity

Fairlie Reese; Brian A. Williams; Gabriela Balderrama-Gutierrez; Dana Wyman; Muhammed Hasan Çelik; Elisabeth Rebboah; Narges Rezaie; Diane Trout; Milad Razavi-Mohseni; Yunzhe Jiang; Beatrice Borsari; Samuel Morabito; Heidi Yahan Liang; Cassandra McGill; Sorena Rahmanian; Jasmine Sakr; Shan Jiang; Weihua Zeng; Klébea Carvalho; Annika K. Weimer; Louise A. Dionne; Ariel McShane; Karan Bedi; Shaimae I. Elhajjajy; Sean Upchurch; Jennifer Jou; Ingrid Youngworth; Idan Gabdank; Paul Sud; Otto Jolanki; J. Seth Strattan; Meenakshi S. Kagda; M Snyder; Ben C. Hitz; Jill E. Moore; Zhiping Weng; David A. Bennett; Laura G. Reinholdt; Mats Ljungman; M Beer; Mark Gerstein; Lior Pachter; Roderic Guigó; B Wold; A Mortazavi

doi:10.1101/2023.05.15.540865

The ENCODE4 long-read RNA-seq collection reveals distinct classes of transcript structure diversity

Fairlie Reese(University of California, Irvine), Brian A. Williams(California Institute of Technology), Gabriela Balderrama-Gutierrez(University of California, Irvine), Dana Wyman(University of California, Irvine), Muhammed Hasan Çelik(University of California, Irvine), Elisabeth Rebboah(University of California, Irvine), Narges Rezaie(University of California, Irvine), Diane Trout(California Institute of Technology), Milad Razavi-Mohseni(Johns Hopkins University), Yunzhe Jiang(Yale University), Beatrice Borsari(Yale University), Samuel Morabito(University of California, Irvine), Heidi Yahan Liang(University of California, Irvine), Cassandra McGill(University of California, Irvine), Sorena Rahmanian(University of California, Irvine), Jasmine Sakr(University of California, Irvine), Shan Jiang(University of California, Irvine), Weihua Zeng(University of California, Irvine), Klébea Carvalho(University of California, Irvine), Annika K. Weimer(Stanford University), Louise A. Dionne(Jackson Laboratory), Ariel McShane(University of Michigan), Karan Bedi(University of Michigan), Shaimae I. Elhajjajy(University of Massachusetts Chan Medical School), Sean Upchurch(California Institute of Technology), Jennifer Jou(Stanford University), Ingrid Youngworth(Stanford University), Idan Gabdank(Stanford University), Paul Sud(Stanford University), Otto Jolanki(Stanford University), J. Seth Strattan(Stanford University), Meenakshi S. Kagda(Stanford University), M Snyder(Stanford University), Ben C. Hitz(Stanford University), Jill E. Moore(University of Massachusetts Chan Medical School), Zhiping Weng(University of Massachusetts Chan Medical School), David A. Bennett(Rush University Medical Center), Laura G. Reinholdt(Jackson Laboratory), Mats Ljungman(University of Michigan), M Beer(Johns Hopkins University), Mark Gerstein(Yale University), Lior Pachter(California Institute of Technology), Roderic Guigó(Universitat Pompeu Fabra), B Wold(California Institute of Technology), A Mortazavi(University of California, Irvine)

bioRxiv (Cold Spring Harbor Laboratory)

May 16, 2023

10.1101/2023.05.15.540865

Cited by 63Open Access

Full Text

Abstract

The majority of mammalian genes encode multiple transcript isoforms that result from differential promoter use, changes in exonic splicing, and alternative 3' end choice. Detecting and quantifying transcript isoforms across tissues, cell types, and species has been extremely challenging because transcripts are much longer than the short reads normally used for RNA-seq. By contrast, long-read RNA-seq (LR-RNA-seq) gives the complete structure of most transcripts. We sequenced 264 LR-RNA-seq PacBio libraries totaling over 1 billion circular consensus reads (CCS) for 81 unique human and mouse samples. We detect at least one full-length transcript from 87.7% of annotated human protein coding genes and a total of 200,000 full-length transcripts, 40% of which have novel exon junction chains. To capture and compute on the three sources of transcript structure diversity, we introduce a gene and transcript annotation framework that uses triplets representing the transcript start site, exon junction chain, and transcript end site of each transcript. Using triplets in a simplex representation demonstrates how promoter selection, splice pattern, and 3' processing are deployed across human tissues, with nearly half of multi-transcript protein coding genes showing a clear bias toward one of the three diversity mechanisms. Evaluated across samples, the predominantly expressed transcript changes for 74% of protein coding genes. In evolution, the human and mouse transcriptomes are globally similar in types of transcript structure diversity, yet among individual orthologous gene pairs, more than half (57.8%) show substantial differences in mechanism of diversification in matching tissues. This initial large-scale survey of human and mouse long-read transcriptomes provides a foundation for further analyses of alternative transcript usage, and is complemented by short-read and microRNA data on the same samples and by epigenome data elsewhere in the ENCODE4 collection.

Related Papers

Basic local alignment search tool

Stephen F. Altschul, Warren Gish, Webb Miller et al.|Journal of Molecular Biology|1990|94.2k

BEDTools: a flexible suite of utilities for comparing genomic features

Aaron R. Quinlan, Ira M. Hall|Bioinformatics|2010|30.3k

Mapping and quantifying mammalian transcriptomes by RNA-Seq

A Mortazavi, Brian A. Williams, Kenneth McCue et al.|Nature Methods|2008|14.3k

Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation

Nuala A. O’Leary, Matt W. Wright, J. Rodney Brister et al.|Nucleic Acids Research|2015|7k

Landscape of transcription in human cells

Sarah Djebali, Carrie Davis, Angelika Merkel et al.|Nature|2012|5.4k