Georgia Institute of Technology
Publishes on Advanced Clustering Algorithms Research, Complexity and Algorithms in Graphs, Bioinformatics and Genomic Networks. 19 papers and 3.4k citations.
Add your photo, update your bio, and get notified when your ranking changes.
We motivate and develop a natural bicriteria measure for assessing the quality of a clustering that avoids the drawbacks of existing measures. A simple recursive heuristic is shown to have poly-logarithmic worst-case guarantees under the new measure. The main result of the article is the analysis of a popular spectral algorithm. One variant of spectral clustering turns out to have effective worst-case guarantees; another finds a "good" clustering, if one exists.
Random projection is a simple geometric technique for reducing the dimensionality of a set of points in Euclidean space while preserving pairwise distances approximately. The technique plays a key role in several breakthrough developments in the field of algorithms. In other cases, it provides elegant alternative proofs. The book begins with an elementary description of the technique and its basic properties. Then it develops the method in the context of applications, which are divided into three groups. The first group consists of combinatorial optimization problems such as maxcut, graph coloring, minimum multicut, graph bandwidth and VLSI layout. Presented in this context is the theory of Euclidean embeddings of graphs. The next group is machine learning problems, specifically, learning intersections of halfspaces and learning large margin hypotheses. The projection method is further refined for the latter application. The last set consists of problems inspired by information retrieval, namely, nearest neighbor search, geometric clustering and efficient low-rank approximation. Motivated by the first two applications, an extension of random projection to the hypercube is developed here. Throughout the book, random projection is used as a way to understand, simplify and connect progress on these important and seemingly unrelated problems. The book is suitable for graduate students and research mathematicians interested in computational geometry.
We present a divide-and-merge methodology for clustering a set of objects that combines a top-down “divide” phase with a bottom-up “merge” phase. In contrast, previous algorithms use either top-down or bottom-up methods to construct a hierarchical clustering or produce a flat clustering using local search (e.g., k -means). For the divide phase, which produces a tree whose leaves are the elements of the set, we suggest an efficient spectral algorithm. When the data is in the form of a sparse document-term matrix, we show how to modify the algorithm so that it maintains sparsity and runs in linear space. The merge phase quickly finds the optimal partition that respects the tree for many natural objective functions, for example, k -means, min-diameter, min-sum, correlation clustering, etc. We present a thorough experimental evaluation of the methodology. We describe the implementation of a meta-search engine that uses this methodology to cluster results from web searches. We also give comparative empirical results on several real datasets.
SUMMARY: In constraint-based metabolic modelling, physical and biochemical constraints define a polyhedral convex set of feasible flux vectors. Uniform sampling of this set provides an unbiased characterization of the metabolic capabilities of a biochemical network. However, reliable uniform sampling of genome-scale biochemical networks is challenging due to their high dimensionality and inherent anisotropy. Here, we present an implementation of a new sampling algorithm, coordinate hit-and-run with rounding (CHRR). This algorithm is based on the provably efficient hit-and-run random walk and crucially uses a preprocessing step to round the anisotropic flux set. CHRR provably converges to a uniform stationary sampling distribution. We apply it to metabolic networks of increasing dimensionality. We show that it converges several times faster than a popular artificial centering hit-and-run algorithm, enabling reliable and tractable sampling of genome-scale biochemical networks. AVAILABILITY AND IMPLEMENTATION: https://github.com/opencobra/cobratoolbox . CONTACT: ronan.mt.fleming@gmail.com or vempala@cc.gatech.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.