GENCODE: reference annotation for the human and mouse genomes in 2023Abstract GENCODE produces high quality gene and transcript annotation for the human and mouse genomes. All GENCODE annotation is supported by experimental data and serves as a reference for genome biology and clinical genomics. The GENCODE consortium generates targeted experimental data, develops bioinformatic tools and carries out analyses that, along with externally produced data and methods, support the identification and annotation of transcript structures and the determination of their function. Here, we present an update on the annotation of human and mouse genes, including developments in the tools, data, analyses and major collaborations which underpin this progress. For example, we report the creation of a set of non-canonical ORFs identified in GENCODE transcripts, the LRGASP collaboration to assess the use of long transcriptomic data to build transcript models, the progress in collaborations with RefSeq and UniProt to increase convergence in the annotation of human and mouse protein-coding genes, the propagation of GENCODE across the human pan-genome and the development of new tools to support annotation of regulatory features by GENCODE. Our annotation is accessible via Ensembl, the UCSC Genome Browser and https://www.gencodegenes.org.
GENCODE 2025: reference gene annotation for human and mouseGENCODE produces comprehensive reference gene annotation for human and mouse. Entering its twentieth year, the project remains highly active as new technologies and methodologies allow us to catalog the genome at ever-increasing granularity. In particular, long-read transcriptome sequencing enables us to identify large numbers of missing transcripts and to substantially improve existing models, and our long non-coding RNA catalogs have undergone a dramatic expansion and reconfiguration as a result. Meanwhile, we are incorporating data from state-of-the-art proteomics and Ribo-seq experiments to fine-tune our annotation of translated sequences, while further insights into function can be gained from multi-genome alignments that grow richer as more species' genomes are sequenced. Such methodologies are combined into a fully integrated annotation workflow. However, the increasing complexity of our resources can present usability challenges, and we are resolving these with the creation of filtered genesets such as MANE Select and GENCODE Primary. The next challenge is to propagate annotations throughout multiple human and mouse genomes, as we enter the pangenome era. Our resources are freely available at our web portal www.gencodegenes.org, and via the Ensembl and UCSC genome browsers.
APPRIS: selecting functionally important isoformsAPPRIS (https://appris.bioinfo.cnio.es) is a well-established database housing annotations for protein isoforms for a range of species. APPRIS selects principal isoforms based on protein structure and function features and on cross-species conservation. Most coding genes produce a single main protein isoform and the principal isoforms chosen by the APPRIS database best represent this main cellular isoform. Human genetic data, experimental protein evidence and the distribution of clinical variants all support the relevance of APPRIS principal isoforms. APPRIS annotations and principal isoforms have now been expanded to 10 model organisms. In this paper we highlight the most recent updates to the database. APPRIS annotations have been generated for two new species, cow and chicken, the protein structural information has been augmented with reliable models from the EMBL-EBI AlphaFold database, and we have substantially expanded the confirmatory proteomics evidence available for the human genome. The most significant change in APPRIS has been the implementation of TRIFID functional isoform scores. TRIFID functional scores are assigned to all splice isoforms, and APPRIS uses the TRIFID functional scores and proteomics evidence to determine principal isoforms when core methods cannot.
The GENCODE CLS project: massively expanding the lncRNA catalog through capture long-read RNA sequencingTamara Perteghella, Gazaldeep Kaur, Sílvia Carbonell Sala et al.|bioRxiv (Cold Spring Harbor Laboratory)|2024 Accurate and complete gene annotations are indispensable for understanding how genome sequences encode biological functions. For twenty years, the GENCODE consortium has developed reference annotations for the human and mouse genomes, becoming a foundation for biomedical and genomics communities worldwide. Nevertheless, collections of important yet poorly-understood gene classes like long non-coding RNAs (lncRNAs) remain incomplete and scattered across multiple, uncoordinated catalogs, slowing down progress in the field. To address these issues, GENCODE has undertaken the most comprehensive lncRNAs annotation effort to date. This is founded on the manual annotation of full-length targeted long-read sequencing, on matched embryonic and adult tissues, of orthologous regions in human and mouse. Altogether 17,931 novel human genes (140,268 novel transcripts) and 22,784 novel mouse genes (136,169 novel transcripts) have been added to the GENCODE catalog representing a 2-fold and 6-fold increase in transcripts, respectively - the greatest increase since the sequencing of the human genome. Novel gene annotations display evolutionary constraints, have well-formed promoter regions, and link to phenotype-associated genetic variants. They greatly enhance the functional interpretability of the human genome, as they help explain millions of previously-mapped "orphan" omics measurements corresponding to transcription start sites, chromatin modifications and transcription factor binding sites. Crucially, our targeted design assigned human-mouse orthologs at a rate beyond previous studies, tripling the number of human disease-associated lncRNAs with mouse orthologs. The expanded and enhanced GENCODE lncRNA annotations mark a critical step towards deciphering the human and mouse genomes.
Origins and Evolution of Human Tandem Duplicated Exon Substitution EventsThe mutually exclusive splicing of tandem duplicated exons produces protein isoforms that are identical save for a homologous region that allows for the fine tuning of protein function. Tandem duplicated exon substitution events are rare, yet highly important alternative splicing events. Most events are ancient, their isoforms are highly expressed, and they have significantly more pathogenic mutations than other splice events. Here, we analyzed the physicochemical properties and functional roles of the homologous polypeptide regions produced by the 236 tandem duplicated exon substitutions annotated in the human gene set. We find that the most important structural and functional residues in these homologous regions are maintained, and that most changes are conservative rather than drastic. Three quarters of the isoforms produced from tandem duplicated exon substitution events are tissue-specific, particularly in nervous and cardiac tissues, and tandem duplicated exon substitution events are enriched in functional terms related to structures in the brain and skeletal muscle. We find considerable evidence for the convergent evolution of tandem duplicated exon substitution events in vertebrates, arthropods, and nematodes. Twelve human gene families have orthologues with tandem duplicated exon substitution events in both Drosophila melanogaster and Caenorhabditis elegans. Six of these gene families are ion transporters, suggesting that tandem exon duplication in genes that control the flow of ions into the cell has an adaptive benefit. The ancient origins, the strong indications of tissue-specific functions, and the evidence of convergent evolution suggest that these events may have played important roles in the evolution of animal tissues and organs.