Jochen Kruppa

Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics

Anne‐Laure Boulesteix, Silke Janitza, Jochen Kruppa et al.|Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery|2012

Cited by 921Open Access

Abstract The random forest (RF) algorithm by Leo Breiman has become a standard data analysis tool in bioinformatics. It has shown excellent performance in settings where the number of variables is much larger than the number of observations, can cope with complex interaction structures as well as highly correlated variables and return measures of variable importance. This paper synthesizes 10 years of RF development with emphasis on applications to bioinformatics and computational biology. Special attention is paid to practical aspects such as the selection of parameters, available RF implementations, and important pitfalls and biases of RF and its variable importance measures (VIMs). The paper surveys recent developments of the methodology relevant to bioinformatics as well as some representative examples of RF applications in this context and possible directions for future research. © 2012 Wiley Periodicals, Inc. This article is categorized under: Algorithmic Development > Hierarchies and Trees Algorithmic Development > Statistics Application Areas > Health Care

Systematic Evaluation of Pleiotropy Identifies 6 Further Loci Associated With Coronary Artery Disease

Tom R. Webb, Jeanette Erdmann, Kathleen Stirrups et al.|Journal of the American College of Cardiology|2017

Cited by 266Open Access

BACKGROUND: Genome-wide association studies have so far identified 56 loci associated with risk of coronary artery disease (CAD). Many CAD loci show pleiotropy; that is, they are also associated with other diseases or traits. OBJECTIVES: This study sought to systematically test if genetic variants identified for non-CAD diseases/traits also associate with CAD and to undertake a comprehensive analysis of the extent of pleiotropy of all CAD loci. METHODS: In discovery analyses involving 42,335 CAD cases and 78,240 control subjects we tested the association of 29,383 common (minor allele frequency >5%) single nucleotide polymorphisms available on the exome array, which included a substantial proportion of known or suspected single nucleotide polymorphisms associated with common diseases or traits as of 2011. Suggestive association signals were replicated in an additional 30,533 cases and 42,530 control subjects. To evaluate pleiotropy, we tested CAD loci for association with cardiovascular risk factors (lipid traits, blood pressure phenotypes, body mass index, diabetes, and smoking behavior), as well as with other diseases/traits through interrogation of currently available genome-wide association study catalogs. RESULTS: with a range of other diseases/traits. CONCLUSIONS: We identified 6 loci associated with CAD at genome-wide significance. Several CAD loci show substantial pleiotropy, which may help us understand the mechanisms by which these loci affect CAD risk.

Probability Machines

James D. Malley, Jochen Kruppa, Abhijit Dasgupta et al.|Methods of Information in Medicine|2011

Cited by 248

BACKGROUND: Most machine learning approaches only provide a classification for binary responses. However, probabilities are required for risk estimation using individual patient characteristics. It has been shown recently that every statistical learning machine known to be consistent for a nonparametric regression problem is a probability machine that is provably consistent for this estimation problem. OBJECTIVES: The aim of this paper is to show how random forests and nearest neighbors can be used for consistent estimation of individual probabilities. METHODS: Two random forest algorithms and two nearest neighbor algorithms are described in detail for estimation of individual probabilities. We discuss the consistency of random forests, nearest neighbors and other learning machines in detail. We conduct a simulation study to illustrate the validity of the methods. We exemplify the algorithms by analyzing two well-known data sets on the diagnosis of appendicitis and the diagnosis of diabetes in Pima Indians. RESULTS: Simulations demonstrate the validity of the method. With the real data application, we show the accuracy and practicality of this approach. We provide sample code from R packages in which the probability estimation is already available. This means that all calculations can be performed using existing software. CONCLUSIONS: Random forest algorithms as well as nearest neighbor approaches are valid machine learning methods for estimating individual probabilities for binary responses. Freely available implementations are available in R and may be used for applications.

Consumer credit risk: Individual probability estimates using machine learning

Jochen Kruppa, Alexandra Schwarz, Gerhard Arminger et al.|Expert Systems with Applications|2013

Cited by 197

Risk estimation and risk prediction using machine-learning methods

Jochen Kruppa, Andreas Ziegler, Inke R. König|Human Genetics|2012

Cited by 180Open Access

After an association between genetic variants and a phenotype has been established, further study goals comprise the classification of patients according to disease risk or the estimation of disease probability. To accomplish this, different statistical methods are required, and specifically machine-learning approaches may offer advantages over classical techniques. In this paper, we describe methods for the construction and evaluation of classification and probability estimation rules. We review the use of machine-learning approaches in this context and explain some of the machine-learning algorithms in detail. Finally, we illustrate the methodology through application to a genome-wide association analysis on rheumatoid arthritis.

Is this you? Claim your profile.

Top publicationsby citations