A Comparison of Document Clustering Techniques

Michael Steinbach(University of Minnesota), George Karypis(University of Minnesota), Vipin Kumar(University of Minnesota)
University of Minnesota Digital Conservancy (University of Minnesota)
May 23, 2000
Cited by 2,458Open Access
Full Text

Abstract

This paper presents the results of an experimental study of some common document clustering techniques. In particular, we compare the two main approaches to document clustering, agglomerative hierarchical clustering and K-means. (For K-means we used a "standard" K-means algorithm and a variant of K-means, "bisecting" K-means.) Hierarchical clustering is often portrayed as the better quality clustering approach, but is limited because of its quadratic time complexity. In contrast, K-means and its variants have a time complexity which is linear in the number of documents, but are thought to produce inferior clusters. Sometimes K-means and agglomerative hierarchical approaches are combined so as to "get the best of both worlds." However, our results indicate that the bisecting K-means technique is better than the standard K-means approach and as good or better than the hierarchical approaches that we tested for a variety of cluster evaluation metrics. We propose an explanation for these results that is based on an analysis of the specifics of the clustering algorithms and the nature of document data.


Related Papers

No related papers found

Powered by citation graph analysis