S

Santosh Vempala

Georgia Institute of Technology

Publishes on Advanced Clustering Algorithms Research, Complexity and Algorithms in Graphs, Bioinformatics and Genomic Networks. 19 papers and 3.4k citations.

19Publications
3.4kTotal Citations

Is this you? Claim your profile.

Add your photo, update your bio, and get notified when your ranking changes.

Top publicationsby citations

On clusterings
Ravi Kannan, Santosh Vempala, Adrian Vetta|Journal of the ACM|2004
Cited by 858

We motivate and develop a natural bicriteria measure for assessing the quality of a clustering that avoids the drawbacks of existing measures. A simple recursive heuristic is shown to have poly-logarithmic worst-case guarantees under the new measure. The main result of the article is the analysis of a popular spectral algorithm. One variant of spectral clustering turns out to have effective worst-case guarantees; another finds a "good" clustering, if one exists.

The Random Projection Method
Santosh Vempala|DIMACS series in discrete mathematics and theoretical computer science|2005
Cited by 543

Random projection is a simple geometric technique for reducing the dimensionality of a set of points in Euclidean space while preserving pairwise distances approximately. The technique plays a key role in several breakthrough developments in the field of algorithms. In other cases, it provides elegant alternative proofs. The book begins with an elementary description of the technique and its basic properties. Then it develops the method in the context of applications, which are divided into three groups. The first group consists of combinatorial optimization problems such as maxcut, graph coloring, minimum multicut, graph bandwidth and VLSI layout. Presented in this context is the theory of Euclidean embeddings of graphs. The next group is machine learning problems, specifically, learning intersections of halfspaces and learning large margin hypotheses. The projection method is further refined for the latter application. The last set consists of problems inspired by information retrieval, namely, nearest neighbor search, geometric clustering and efficient low-rank approximation. Motivated by the first two applications, an extension of random projection to the hypercube is developed here. Throughout the book, random projection is used as a way to understand, simplify and connect progress on these important and seemingly unrelated problems. The book is suitable for graduate students and research mathematicians interested in computational geometry.

A divide-and-merge methodology for clustering
David Cheng, Ravi Kannan, Santosh Vempala et al.|ACM Transactions on Database Systems|2006
Cited by 117

We present a divide-and-merge methodology for clustering a set of objects that combines a top-down “divide” phase with a bottom-up “merge” phase. In contrast, previous algorithms use either top-down or bottom-up methods to construct a hierarchical clustering or produce a flat clustering using local search (e.g., k -means). For the divide phase, which produces a tree whose leaves are the elements of the set, we suggest an efficient spectral algorithm. When the data is in the form of a sparse document-term matrix, we show how to modify the algorithm so that it maintains sparsity and runs in linear space. The merge phase quickly finds the optimal partition that respects the tree for many natural objective functions, for example, k -means, min-diameter, min-sum, correlation clustering, etc. We present a thorough experimental evaluation of the methodology. We describe the implementation of a meta-search engine that uses this methodology to cluster results from web searches. We also give comparative empirical results on several real datasets.

CHRR: coordinate hit-and-run with rounding for uniform sampling of constraint-based models
Hulda S. Haraldsdóttir, Ben Cousins, Ines Thiele et al.|Bioinformatics|2017
Cited by 101Open Access

SUMMARY: In constraint-based metabolic modelling, physical and biochemical constraints define a polyhedral convex set of feasible flux vectors. Uniform sampling of this set provides an unbiased characterization of the metabolic capabilities of a biochemical network. However, reliable uniform sampling of genome-scale biochemical networks is challenging due to their high dimensionality and inherent anisotropy. Here, we present an implementation of a new sampling algorithm, coordinate hit-and-run with rounding (CHRR). This algorithm is based on the provably efficient hit-and-run random walk and crucially uses a preprocessing step to round the anisotropic flux set. CHRR provably converges to a uniform stationary sampling distribution. We apply it to metabolic networks of increasing dimensionality. We show that it converges several times faster than a popular artificial centering hit-and-run algorithm, enabling reliable and tractable sampling of genome-scale biochemical networks. AVAILABILITY AND IMPLEMENTATION: https://github.com/opencobra/cobratoolbox . CONTACT: ronan.mt.fleming@gmail.com or vempala@cc.gatech.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.