Statistical Significance of Clustering for High-Dimension, Low–Sample Size Data

Yufeng Liu(Segeberger Kliniken), D. Neil Hayes(Segeberger Kliniken), Andrew B. Nobel(Nobel Foundation), J. S. Marron(Nobel Foundation)
Journal of the American Statistical Association
September 1, 2008
Cited by 281

Abstract

AbstractClustering methods provide a powerful tool for the exploratory analysis of high-dimension, low–sample size (HDLSS) data sets, such as gene expression microarray data. A fundamental statistical issue in clustering is which clusters are "really there," as opposed to being artifacts of the natural sampling variation. We propose SigClust as a simple and natural approach to this fundamental statistical problem. In particular, we define a cluster as data coming from a single Gaussian distribution and formulate the problem of assessing statistical significance of clustering as a testing procedure. This Gaussian null assumption allows direct formulation of p values that effectively quantify the significance of a given clustering. HDLSS covariance estimation for SigClust is achieved by a combination of invariance principles, together with a factor analysis model. The properties of SigClust are studied. Simulated examples, as well as an application to a real cancer microarray data set, show that the proposed method works remarkably well for assessing significance of clustering. Some theoretical results also are obtained.KEY WORDS: ClusteringHigh-dimension, low–sample size datak-meansMicroarray gene expression datap valueStatistical significance


Related Papers

No related papers found

Powered by citation graph analysis