CD-HIT: accelerated for clustering the next-generation sequencing data

LiMin Fu; Beifang Niu; Zhengwei Zhu; Sitao Wu; Weizhong Li

doi:10.1093/bioinformatics/bts565

CD-HIT: accelerated for clustering the next-generation sequencing data

LiMin Fu(University of California San Diego), Beifang Niu(University of California San Diego), Zhengwei Zhu(University of California San Diego), Sitao Wu(University of California San Diego), Weizhong Li(University of California San Diego)

Bioinformatics

October 11, 2012

10.1093/bioinformatics/bts565

Cited by 11,692Open Access

Full Text

Abstract

SUMMARY: CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses. In response to the rapid increase in the amount of sequencing data produced by the next-generation sequencing technologies, we have developed a new CD-HIT program accelerated with a novel parallelization strategy and some other techniques to allow efficient clustering of such datasets. Our tests demonstrated very good speedup derived from the parallelization for up to ∼24 cores and a quasi-linear speedup for up to ∼8 cores. The enhanced CD-HIT is capable of handling very large datasets in much shorter time than previous versions. AVAILABILITY: http://cd-hit.org. CONTACT: liwz@sdsc.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Related Papers

No related papers found

Powered by citation graph analysis