The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes

Kim D. Pruitt; Jennifer Harrow; Rachel Harte; Craig Wallin; Mark Diekhans; Donna Maglott; Steve Searle; Catherine M. Farrell; Jane Loveland; Barbara J. Ruef; Elizabeth A. Hart; Marie‐Marthe Suner; Melissa Landrum; Bronwen Aken; Sarah Ayling; Robert Baertsch; Julio Fernandez-Banet; Joshua L. Cherry; Val Curwen; Michael DiCuccio; Manolis Kellis; Jennifer Lee; Michael F. Lin; Michael Schuster; Andrew Shkeda; Clara Amid; Garth Brown; Oksana I. Dukhanina; Adam Frankish; Jennifer Hart; B. Maidak; Jonathan M. Mudge; Michael R. Murphy; Terence D. Murphy; Jeena Rajan; Bhanu Rajput; Lillian D. Riddick; Catherine Snow; Charles A. Steward; David Webb; Janet A. Weber; Laurens Wilming; Wenyu Wu; Ewan Birney; David Haussler; Tim Hubbard; James Ostell; Richard Durbin; David J. Lipman

doi:10.1101/gr.080531.108

The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes

Kim D. Pruitt(National Center for Biotechnology Information), Jennifer Harrow(Wellcome Sanger Institute), Rachel Harte(University of California, Santa Cruz), Craig Wallin(National Center for Biotechnology Information), Mark Diekhans(University of California, Santa Cruz), Donna Maglott(National Center for Biotechnology Information), Steve Searle(Wellcome Sanger Institute), Catherine M. Farrell(National Center for Biotechnology Information), Jane Loveland(Wellcome Sanger Institute), Barbara J. Ruef(University of Oregon), Elizabeth A. Hart(Wellcome Sanger Institute), Marie‐Marthe Suner(Wellcome Sanger Institute), Melissa Landrum(National Center for Biotechnology Information), Bronwen Aken(Wellcome Sanger Institute), Sarah Ayling(University of Manchester), Robert Baertsch(University of California, Santa Cruz), Julio Fernandez-Banet(Wellcome Sanger Institute), Joshua L. Cherry(National Center for Biotechnology Information), Val Curwen(Wellcome Sanger Institute), Michael DiCuccio(National Center for Biotechnology Information), Manolis Kellis(Broad Institute), Jennifer Lee(National Center for Biotechnology Information), Michael F. Lin(Broad Institute), Michael Schuster(European Bioinformatics Institute), Andrew Shkeda(National Center for Biotechnology Information), Clara Amid(University of Oregon), Garth Brown(National Center for Biotechnology Information), Oksana I. Dukhanina(National Center for Biotechnology Information), Adam Frankish(Wellcome Sanger Institute), Jennifer Hart(National Center for Biotechnology Information), B. Maidak(National Center for Biotechnology Information), Jonathan M. Mudge(Wellcome Sanger Institute), Michael R. Murphy(National Center for Biotechnology Information), Terence D. Murphy(National Center for Biotechnology Information), Jeena Rajan(Wellcome Sanger Institute), Bhanu Rajput(National Center for Biotechnology Information), Lillian D. Riddick(National Center for Biotechnology Information), Catherine Snow(Wellcome Sanger Institute), Charles A. Steward(Wellcome Sanger Institute), David Webb(National Center for Biotechnology Information), Janet A. Weber(National Center for Biotechnology Information), Laurens Wilming(Wellcome Sanger Institute), Wenyu Wu(National Center for Biotechnology Information), Ewan Birney(European Bioinformatics Institute), David Haussler(University of California, Santa Cruz), Tim Hubbard(Wellcome Sanger Institute), James Ostell(National Center for Biotechnology Information), Richard Durbin(Wellcome Sanger Institute), David J. Lipman(National Center for Biotechnology Information)

Genome Research

June 4, 2009

10.1101/gr.080531.108

Cited by 590Open Access

Full Text

Abstract

Effective use of the human and mouse genomes requires reliable identification of genes and their products. Although multiple public resources provide annotation, different methods are used that can result in similar but not identical representation of genes, transcripts, and proteins. The collaborative consensus coding sequence (CCDS) project tracks identical protein annotations on the reference mouse and human genomes with a stable identifier (CCDS ID), and ensures that they are consistently represented on the NCBI, Ensembl, and UCSC Genome Browsers. Importantly, the project coordinates on manually reviewing inconsistent protein annotations between sites, as well as annotations for which new evidence suggests a revision is needed, to progressively converge on a complete protein-coding set for the human and mouse reference genomes, while maintaining a high standard of reliability and biological accuracy. To date, the project has identified 20,159 human and 17,707 mouse consensus coding regions from 17,052 human and 16,893 mouse genes. Three evaluation methods indicate that the entries in the CCDS set are highly likely to represent real proteins, more so than annotations from contributing groups not included in CCDS. The CCDS database thus centralizes the function of identifying well-supported, identically-annotated, protein-coding regions.

Related Papers

No related papers found

Powered by citation graph analysis