Clustering of highly homologous sequences to reduce the size of large protein databases

Weizhong Li(San Diego Supercomputer Center), Lukasz Jaroszewski, Adam Godzik
Bioinformatics
March 1, 2001
Cited by 1,072

Abstract

Abstract Summary: We present a fast and flexible program for clustering large protein databases at different sequence identity levels. It takes less than 2 h for the all-against-all sequence comparison and clustering of the non-redundant protein database of over 560000 sequences on a high-end PC. The output database, including only the representative sequences, can be used for more efficient and sensitive database searches. Availability: The program is available from http://bioinformatics.burnham-inst.org/cd-hi Contact: liwz@sdsc.edu or adam@burnham-inst.org * To whom correspondence should be addressed.


Related Papers