DUBStepR is a scalable correlation-based feature selection method for accurately clustering single-cell data

Bobby Ranjan(Agency for Science, Technology and Research), Wenjie Sun(Agency for Science, Technology and Research), Jinyu Park(Agency for Science, Technology and Research), Kunal Mishra(Agency for Science, Technology and Research), Florian Schmidt(Agency for Science, Technology and Research), Ronald Xie(Agency for Science, Technology and Research), Fatemeh Alipour(Agency for Science, Technology and Research), Vipul Singhal(Agency for Science, Technology and Research), Ignasius Joanito(Agency for Science, Technology and Research), Mohammad Amin Honardoost(Agency for Science, Technology and Research), Jacy Mei Yun Yong(Tan Tock Seng Hospital), Ee Tzun Koh(Tan Tock Seng Hospital), Khai Pang Leong(Tan Tock Seng Hospital), Nirmala Arul Rayan(Agency for Science, Technology and Research), Michelle Gek Liang Lim(Agency for Science, Technology and Research), Shyam Prabhakar(Agency for Science, Technology and Research)
Nature Communications
October 6, 2021
Cited by 84Open Access
Full Text

Abstract

Feature selection (marker gene selection) is widely believed to improve clustering accuracy, and is thus a key component of single cell clustering pipelines. Existing feature selection methods perform inconsistently across datasets, occasionally even resulting in poorer clustering accuracy than without feature selection. Moreover, existing methods ignore information contained in gene-gene correlations. Here, we introduce DUBStepR (Determining the Underlying Basis using Stepwise Regression), a feature selection algorithm that leverages gene-gene correlations with a novel measure of inhomogeneity in feature space, termed the Density Index (DI). Despite selecting a relatively small number of genes, DUBStepR substantially outperformed existing single-cell feature selection methods across diverse clustering benchmarks. Additionally, DUBStepR was the only method to robustly deconvolve T and NK heterogeneity by identifying disease-associated common and rare cell types and subtypes in PBMCs from rheumatoid arthritis patients. DUBStepR is scalable to over a million cells, and can be straightforwardly applied to other data types such as single-cell ATAC-seq. We propose DUBStepR as a general-purpose feature selection solution for accurately clustering single-cell data.


Related Papers