Statistical Significance of Clustering for High-Dimension, Low–Sample Size Data

Yufeng Liu; D. Neil Hayes; Andrew B. Nobel; J. S. Marron

doi:10.1198/016214508000000454

Statistical Significance of Clustering for High-Dimension, Low–Sample Size Data

Yufeng Liu(Segeberger Kliniken), D. Neil Hayes(Segeberger Kliniken), Andrew B. Nobel(Nobel Foundation), J. S. Marron(Nobel Foundation)

Journal of the American Statistical Association

September 1, 2008

10.1198/016214508000000454

Cited by 281

Abstract

AbstractClustering methods provide a powerful tool for the exploratory analysis of high-dimension, low–sample size (HDLSS) data sets, such as gene expression microarray data. A fundamental statistical issue in clustering is which clusters are "really there," as opposed to being artifacts of the natural sampling variation. We propose SigClust as a simple and natural approach to this fundamental statistical problem. In particular, we define a cluster as data coming from a single Gaussian distribution and formulate the problem of assessing statistical significance of clustering as a testing procedure. This Gaussian null assumption allows direct formulation of p values that effectively quantify the significance of a given clustering. HDLSS covariance estimation for SigClust is achieved by a combination of invariance principles, together with a factor analysis model. The properties of SigClust are studied. Simulated examples, as well as an application to a real cancer microarray data set, show that the proposed method works remarkably well for assessing significance of clustering. Some theoretical results also are obtained.KEY WORDS: ClusteringHigh-dimension, low–sample size datak-meansMicroarray gene expression datap valueStatistical significance

Related Papers

No related papers found

Powered by citation graph analysis