Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts

Liang Sun(Chinese Academy of Sciences), Haitao Luo(Chinese Academy of Sciences), Dechao Bu(Chinese Academy of Sciences), Guoguang Zhao(Chinese Academy of Sciences), Kuntao Yu(Chinese Academy of Sciences), Changhai Zhang(Chinese Academy of Sciences), Yuanning Liu(Chinese Academy of Sciences), Runsheng Chen(Chinese Academy of Sciences), Yi Zhao(Chinese Academy of Sciences)
Nucleic Acids Research
July 27, 2013
Cited by 2,333Open Access
Full Text

Abstract

It is a challenge to classify protein-coding or non-coding transcripts, especially those re-constructed from high-throughput sequencing data of poorly annotated species. This study developed and evaluated a powerful signature tool, Coding-Non-Coding Index (CNCI), by profiling adjoining nucleotide triplets to effectively distinguish protein-coding and non-coding sequences independent of known annotations. CNCI is effective for classifying incomplete transcripts and sense-antisense pairs. The implementation of CNCI offered highly accurate classification of transcripts assembled from whole-transcriptome sequencing data in a cross-species manner, that demonstrated gene evolutionary divergence between vertebrates, and invertebrates, or between plants, and provided a long non-coding RNA catalog of orangutan. CNCI software is available at http://www.bioinfo.org/software/cnci.


Related Papers