The GENCODE CLS project: massively expanding the lncRNA catalog through capture long-read RNA sequencing

Tamara Perteghella(Universitat Pompeu Fabra), Gazaldeep Kaur(Universitat Pompeu Fabra), Sílvia Carbonell Sala(Centre for Genomic Regulation), José M. González(European Bioinformatics Institute), Toby Hunt(European Bioinformatics Institute), Tomasz Mądry(Institute of Bioorganic Chemistry, Polish Academy of Sciences), Irwin Jungreis(Broad Institute), Fabien Degalez(Centre for Genomic Regulation), Carme Arnan(SOM Biotech (Spain)), Ramil Nurtdinov(Yale University), Julien Lagarde(SOM Biotech (Spain)), Beatrice Borsari(Yale University), Cristina Sisu(European Bioinformatics Institute), Yunzhe Jiang(European Bioinformatics Institute), Ruth Bennett(European Bioinformatics Institute), Andrew Berry(European Bioinformatics Institute), Marta Blangiewicz(Hospital Del Mar), Daniel Cerdán-Vélez(European Bioinformatics Institute), Kelly Cochran(European Bioinformatics Institute), Covadonga Vara(Yale University), Claire Davidson(European Bioinformatics Institute), Sarah Donaldson(European Bioinformatics Institute), Cagatay Dursun(European Bioinformatics Institute), Silvia González-López(European Bioinformatics Institute), Sasti Gopal Das(European Bioinformatics Institute), Kathryn Lawrence(Hospital Del Mar), Daniel Nachun(Yale University), Matthew P. Hardy(European Bioinformatics Institute), Zoe Hollis(European Bioinformatics Institute), Mike Kay(University College Dublin), José Carlos Montañés(European Bioinformatics Institute), Pengyu Ni(Yale University), Emilio Palumbo(Yale University), Carlos Pulido-Quetglas(University College Dublin), Marie‐Marthe Suner(Institució Catalana de Recerca i Estudis Avançats), X. Yu(University of California, Santa Cruz), Dingyao Zhang(University of Vienna), François Aguet(Broad Institute), Kristin Ardlie(Broad Institute), Stephen B. Montgomery(European Bioinformatics Institute), Jane Loveland(European Bioinformatics Institute), M. Mar Albà(Broad Institute), Mark Diekhans(University of California, Santa Cruz), Andrea Tanzer(University of Vienna), Jonathan M. Mudge(European Bioinformatics Institute), Paul Flicek(University College Dublin), Fergal J. Martin(European Bioinformatics Institute), Mark Gerstein(European Bioinformatics Institute), M. Kellis(Broad Institute), Anshul Kundaje(Stanford University), Benedict Paten(University of California, Santa Cruz), Michael L. Tress(Spanish National Cancer Research Centre), Rory Johnson(University College Dublin), Barbara Uszczyńska-Ratajczak(Institute of Bioorganic Chemistry, Polish Academy of Sciences), Adam Frankish(European Bioinformatics Institute), Roderic Guigó(Universitat Pompeu Fabra)
bioRxiv (Cold Spring Harbor Laboratory)
October 31, 2024
Cited by 25Open Access
Full Text

Abstract

Accurate and complete gene annotations are indispensable for understanding how genome sequences encode biological functions. For twenty years, the GENCODE consortium has developed reference annotations for the human and mouse genomes, becoming a foundation for biomedical and genomics communities worldwide. Nevertheless, collections of important yet poorly-understood gene classes like long non-coding RNAs (lncRNAs) remain incomplete and scattered across multiple, uncoordinated catalogs, slowing down progress in the field. To address these issues, GENCODE has undertaken the most comprehensive lncRNAs annotation effort to date. This is founded on the manual annotation of full-length targeted long-read sequencing, on matched embryonic and adult tissues, of orthologous regions in human and mouse. Altogether 17,931 novel human genes (140,268 novel transcripts) and 22,784 novel mouse genes (136,169 novel transcripts) have been added to the GENCODE catalog representing a 2-fold and 6-fold increase in transcripts, respectively - the greatest increase since the sequencing of the human genome. Novel gene annotations display evolutionary constraints, have well-formed promoter regions, and link to phenotype-associated genetic variants. They greatly enhance the functional interpretability of the human genome, as they help explain millions of previously-mapped "orphan" omics measurements corresponding to transcription start sites, chromatin modifications and transcription factor binding sites. Crucially, our targeted design assigned human-mouse orthologs at a rate beyond previous studies, tripling the number of human disease-associated lncRNAs with mouse orthologs. The expanded and enhanced GENCODE lncRNA annotations mark a critical step towards deciphering the human and mouse genomes.


Related Papers

No related papers found

Powered by citation graph analysis