Clustering of highly homologous sequences to reduce the size of large protein databases

Weizhong Li; Lukasz Jaroszewski; Adam Godzik

doi:10.1093/bioinformatics/17.3.282

Clustering of highly homologous sequences to reduce the size of large protein databases

Weizhong Li(San Diego Supercomputer Center), Lukasz Jaroszewski, Adam Godzik

Bioinformatics

March 1, 2001

10.1093/bioinformatics/17.3.282

Cited by 1,072

Abstract

Abstract Summary: We present a fast and flexible program for clustering large protein databases at different sequence identity levels. It takes less than 2 h for the all-against-all sequence comparison and clustering of the non-redundant protein database of over 560000 sequences on a high-end PC. The output database, including only the representative sequences, can be used for more efficient and sensitive database searches. Availability: The program is available from http://bioinformatics.burnham-inst.org/cd-hi Contact: liwz@sdsc.edu or adam@burnham-inst.org * To whom correspondence should be addressed.

Stephen F. Altschul|Nucleic Acids Research|1997|74.4k

Removing near-neighbour redundancy from large protein sequence collections.

Liisa Holm, Chris Sander|Bioinformatics|1998|303

Clustering of highly homologous sequences to reduce the size of large protein databases

Abstract

Related Papers