BIRCHFinding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of clusters, or densely populated regions, in a multi-dimensional dataset. Prior work does not adequately address the problem of large datasets and minimization of I/O costs.This paper presents a data clustering method named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e., available memory and time constraints). BIRCH can typically find a good clustering with a single scan of the data, and improve the quality further with a few additional scans. BIRCH is also the first clustering algorithm proposed in the database area to handle "noise" (data points that are not part of the underlying pattern) effectively.We evaluate BIRCH's time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance comparisons of BIRCH versus CLARANS, a clustering method proposed recently for large datasets, and show that BIRCH is consistently superior.
Condor-a hunter of idle workstationsThe design, implementation, and performance of the Condor scheduling system, which operates in a workstation environment, are presented. The system aims to maximize the utilization of workstations with as little interference as possible between the jobs it schedules and the activities of the people who own workstations. It identifies idle workstations and schedules background jobs on them. When the owner of a workstation resumes activity at a station, Condor checkpoints the remote job running on the station and transfers it to another workstation. The system guarantees that the job will eventually complete, and that very little, if any, work will be performed more than once. A performance profile of the system is presented that is based on data accumulated from 23 stations during one month.< <ETX xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">></ETX>
BIRCHFinding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of clusters, or densely populated regions, in a multi-dimensional dataset. Prior work does not adequately address the problem of large datasets and minimization of I/O costs.This paper presents a data clustering method named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e., available memory and time constraints). BIRCH can typically find a good clustering with a single scan of the data, and improve the quality further with a few additional scans. BIRCH is also the first clustering algorithm proposed in the database area to handle "noise" (data points that are not part of the underlying pattern) effectively.We evaluate BIRCH 's time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance comparisons of BIRCH versus CLARANS, a clustering method proposed recently for large datasets, and show that BIRCH is consistently superior.
Distributed computing in practice: the Condor experienceDouglas Thain, Todd Tannenbaum, Miron Livny|Concurrency and Computation Practice and Experience|2005 Abstract Since 1984, the Condor project has enabled ordinary users to do extraordinary computing. Today, the project continues to explore the social and technical problems of cooperative computing on scales ranging from the desktop to the world‐wide computational Grid. In this paper, we provide the history and philosophy of the Condor project and describe how it has interacted with other projects and evolved along with the field of distributed computing. We outline the core components of the Condor system and describe how the technology of computing must correspond to social structures. Throughout, we reflect on the lessons of experience and chart the course travelled by research ideas as they grow into production systems. Copyright © 2005 John Wiley & Sons, Ltd.
BioMagResBankThe BioMagResBank (BMRB: www.bmrb.wisc.edu) is a repository for experimental and derived data gathered from nuclear magnetic resonance (NMR) spectroscopic studies of biological molecules. BMRB is a partner in the Worldwide Protein Data Bank (wwPDB). The BMRB archive consists of four main data depositories: (i) quantitative NMR spectral parameters for proteins, peptides, nucleic acids, carbohydrates and ligands or cofactors (assigned chemical shifts, coupling constants and peak lists) and derived data (relaxation parameters, residual dipolar couplings, hydrogen exchange rates, pK(a) values, etc.), (ii) databases for NMR restraints processed from original author depositions available from the Protein Data Bank, (iii) time-domain (raw) spectral data from NMR experiments used to assign spectral resonances and determine the structures of biological macromolecules and (iv) a database of one- and two-dimensional (1)H and (13)C one- and two-dimensional NMR spectra for over 250 metabolites. The BMRB website provides free access to all of these data. BMRB has tools for querying the archive and retrieving information and an ftp site (ftp.bmrb.wisc.edu) where data in the archive can be downloaded in bulk. Two BMRB mirror sites exist: one at the PDBj, Protein Research Institute, Osaka University, Osaka, Japan (bmrb.protein.osaka-u.ac.jp) and the other at CERM, University of Florence, Florence, Italy (bmrb.postgenomicnmr.net/). The site at Osaka also accepts and processes data depositions.