Characterization and identification of long non-coding RNAs based on feature relationship

Guangyu Wang(Chinese Academy of Sciences), Hongyan Yin(Chinese Academy of Sciences), Boyang Li(Yale University), Chunlei Yu(Chinese Academy of Sciences), Fan Wang(Chinese Academy of Sciences), Xingjian Xu(Chinese Academy of Sciences), Jiabao Cao(Chinese Academy of Sciences), Yīmíng Bào(Chinese Academy of Sciences), Liguo Wang(Mayo Clinic), Amir Ali Abbasi(Quaid-i-Azam University), Vladimir B. Bajić(King Abdullah University of Science and Technology), Lina Ma(Chinese Academy of Sciences), Zhang Zhang(Chinese Academy of Sciences)
Bioinformatics
January 8, 2019
Cited by 157Open Access
Full Text

Abstract

MOTIVATION: The significance of long non-coding RNAs (lncRNAs) in many biological processes and diseases has gained intense interests over the past several years. However, computational identification of lncRNAs in a wide range of species remains challenging; it requires prior knowledge of well-established sequences and annotations or species-specific training data, but the reality is that only a limited number of species have high-quality sequences and annotations. RESULTS: Here we first characterize lncRNAs in contrast to protein-coding RNAs based on feature relationship and find that the feature relationship between open reading frame length and guanine-cytosine (GC) content presents universally substantial divergence in lncRNAs and protein-coding RNAs, as observed in a broad variety of species. Based on the feature relationship, accordingly, we further present LGC, a novel algorithm for identifying lncRNAs that is able to accurately distinguish lncRNAs from protein-coding RNAs in a cross-species manner without any prior knowledge. As validated on large-scale empirical datasets, comparative results show that LGC outperforms existing algorithms by achieving higher accuracy, well-balanced sensitivity and specificity, and is robustly effective (>90% accuracy) in discriminating lncRNAs from protein-coding RNAs across diverse species that range from plants to mammals. To our knowledge, this study, for the first time, differentially characterizes lncRNAs and protein-coding RNAs based on feature relationship, which is further applied in computational identification of lncRNAs. Taken together, our study represents a significant advance in characterization and identification of lncRNAs and LGC thus bears broad potential utility for computational analysis of lncRNAs in a wide range of species. AVAILABILITY AND IMPLEMENTATION: LGC web server is publicly available at http://bigd.big.ac.cn/lgc/calculator. The scripts and data can be downloaded at http://bigd.big.ac.cn/biocode/tools/BT000004. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Related Papers

No related papers found

Powered by citation graph analysis