Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation

Nuala A. O’Leary(National Institutes of Health), Matt W. Wright(National Institutes of Health), J. Rodney Brister(National Institutes of Health), Stacy Ciufo(National Institutes of Health), Diana Haddad(National Institutes of Health), Rich McVeigh(National Institutes of Health), Bhanu Rajput(National Institutes of Health), Barbara Robbertse(National Institutes of Health), Brian Smith-White(National Institutes of Health), Danso Ako-adjei(National Institutes of Health), Alexander Astashyn(National Institutes of Health), Azat Badretdin(National Institutes of Health), Yīmíng Bào(National Institutes of Health), Olga Blinkova(National Institutes of Health), Vyacheslav Brover(National Institutes of Health), Vyacheslav Chetvernin(National Institutes of Health), Jinna Choi(National Institutes of Health), Eric Cox(National Institutes of Health), Olga Ermolaeva(National Institutes of Health), Catherine M. Farrell(National Institutes of Health), Tamara Goldfarb(National Institutes of Health), Tripti Gupta(National Institutes of Health), Daniel H. Haft(National Institutes of Health), Eneida Hatcher(National Institutes of Health), Wratko Hlavina(National Institutes of Health), Vinita Joardar(National Institutes of Health), Vamsi K. Kodali(National Institutes of Health), Wenjun Li(National Institutes of Health), Donna Maglott(National Institutes of Health), Patrick Masterson(National Institutes of Health), Kelly M. McGarvey(National Institutes of Health), Michael R. Murphy(National Institutes of Health), Kathleen O’Neill(National Institutes of Health), Shashikant Pujar(National Institutes of Health), Sanjida H Rangwala(National Institutes of Health), Daniel Rausch(National Institutes of Health), Lillian D. Riddick(National Institutes of Health), Conrad L. Schoch(National Institutes of Health), Andrei Shkeda(National Institutes of Health), Susan S. Storz(National Institutes of Health), Hanzhen Sun(National Institutes of Health), Françoise Thibaud‐Nissen(National Institutes of Health), Igor Tolstoy(National Institutes of Health), Raymond E. Tully(National Institutes of Health), Anjana R. Vatsan(National Institutes of Health), Craig Wallin(National Institutes of Health), David Webb(National Institutes of Health), Wendy Wu(National Institutes of Health), Melissa Landrum(National Institutes of Health), Avi Kimchi(National Institutes of Health), Tatiana Tatusova(National Institutes of Health), Michael DiCuccio(National Institutes of Health), Paul Kitts(National Institutes of Health), Terence D. Murphy(National Institutes of Health), Kim D. Pruitt(National Institutes of Health)
Nucleic Acids Research
November 8, 2015
Cited by 7,035Open Access
Full Text

Abstract

The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55,000 organisms (>4800 viruses, >40,000 prokaryotes and >10,000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management.


Related Papers

No related papers found

Powered by citation graph analysis