A divide-and-merge methodology for clustering

David Cheng(Massachusetts Institute of Technology), Ravi Kannan(Yale University), Santosh Vempala(Massachusetts Institute of Technology), Grant Wang(Massachusetts Institute of Technology)
ACM Transactions on Database Systems
December 1, 2006
Cited by 117

Abstract

We present a divide-and-merge methodology for clustering a set of objects that combines a top-down “divide” phase with a bottom-up “merge” phase. In contrast, previous algorithms use either top-down or bottom-up methods to construct a hierarchical clustering or produce a flat clustering using local search (e.g., k -means). For the divide phase, which produces a tree whose leaves are the elements of the set, we suggest an efficient spectral algorithm. When the data is in the form of a sparse document-term matrix, we show how to modify the algorithm so that it maintains sparsity and runs in linear space. The merge phase quickly finds the optimal partition that respects the tree for many natural objective functions, for example, k -means, min-diameter, min-sum, correlation clustering, etc. We present a thorough experimental evaluation of the methodology. We describe the implementation of a meta-search engine that uses this methodology to cluster results from web searches. We also give comparative empirical results on several real datasets.


Related Papers

No related papers found

Powered by citation graph analysis