Gilles Louppe

API design for machine learning software: experiences from the scikit-learn project

Lars Buitinck, Gilles Louppe, Mathieu Blondel et al.|arXiv (Cornell University)|2013

Cited by 1.8kOpen Access

Scikit-learn is an increasingly popular machine learning li- brary. Written in Python, it is designed to be simple and efficient, accessible to non-experts, and reusable in various contexts. In this paper, we present and discuss our design choices for the application programming interface (API) of the project. In particular, we describe the simple and elegant interface shared by all learning and processing units in the library and then discuss its advantages in terms of composition and reusability. The paper also comments on implementation details specific to the Python ecosystem and analyzes obstacles faced by users and developers of the library.

Understanding variable importances in forests of randomized trees

Gilles Louppe, Louis Wehenkel, Antonio Sutera et al.|Open Repository and Bibliography (University of Liège)|2013

Cited by 806Open Access

peer reviewed

Scikit-learn

Gaël Varoquaux, Lars Buitinck, Gilles Louppe et al.|GetMobile Mobile Computing and Communications|2015

Cited by 684

Machine learning is a pervasive development at the intersection of statistics and computer science. While it can benefit many data-related applications, the technical nature of the research literature and the corresponding algorithms slows down its adoption. Scikit-learn is an open-source software project that aims at making machine learning accessible to all, whether it be in academia or in industry. It benefits from the general-purpose Python language, which is both broadly adopted in the scientific world, and supported by a thriving ecosystem of contributors. Here we give a quick introduction to scikit-learn as well as to machine-learning basics.

Understanding Random Forests: From Theory to Practice

Gilles Louppe|arXiv (Cornell University)|2014

Cited by 596Open Access

Data analysis and machine learning have become an integrative part of the modern scientific methodology, offering automated procedures for the prediction of a phenomenon based on past observations, unraveling underlying patterns in data and providing insights about the problem. Yet, caution should avoid using machine learning as a black-box tool, but rather consider it as a methodology, with a rational thought process that is entirely dependent on the problem under study. In particular, the use of algorithms should ideally require a reasonable understanding of their mechanisms, properties and limitations, in order to better apprehend and interpret their results. Accordingly, the goal of this thesis is to provide an in-depth analysis of random forests, consistently calling into question each and every part of the algorithm, in order to shed new light on its learning capabilities, inner workings and interpretability. The first part of this work studies the induction of decision trees and the construction of ensembles of randomized trees, motivating their design and purpose whenever possible. Our contributions follow with an original complexity analysis of random forests, showing their good computational performance and scalability, along with an in-depth discussion of their implementation details, as contributed within Scikit-Learn. In the second part of this work, we analyse and discuss the interpretability of random forests in the eyes of variable importance measures. The core of our contributions rests in the theoretical characterization of the Mean Decrease of Impurity variable importance measure, from which we prove and derive some of its properties in the case of multiway totally randomized trees and in asymptotic conditions. In consequence of this work, our analysis demonstrates that variable importances [...].

Robust EEG-based cross-site and cross-protocol classification of states of consciousness

Denis A. Engemann, Federico Raimondo, Jean-Rémi King et al.|Brain|2018

Cited by 321Open Access

Determining the state of consciousness in patients with disorders of consciousness is a challenging practical and theoretical problem. Recent findings suggest that multiple markers of brain activity extracted from the EEG may index the state of consciousness in the human brain. Furthermore, machine learning has been found to optimize their capacity to discriminate different states of consciousness in clinical practice. However, it is unknown how dependable these EEG markers are in the face of signal variability because of different EEG configurations, EEG protocols and subpopulations from different centres encountered in practice. In this study we analysed 327 recordings of patients with disorders of consciousness (148 unresponsive wakefulness syndrome and 179 minimally conscious state) and 66 healthy controls obtained in two independent research centres (Paris Pitié-Salpêtrière and Liège). We first show that a non-parametric classifier based on ensembles of decision trees provides robust out-of-sample performance on unseen data with a predictive area under the curve (AUC) of ~0.77 that was only marginally affected when using alternative EEG configurations (different numbers and positions of sensors, numbers of epochs, average AUC = 0.750 ± 0.014). In a second step, we observed that classifiers based on multiple as well as single EEG features generalize to recordings obtained from different patient cohorts, EEG protocols and different centres. However, the multivariate model always performed best with a predictive AUC of 0.73 for generalization from Paris 1 to Paris 2 datasets, and an AUC of 0.78 from Paris to Liège datasets. Using simulations, we subsequently demonstrate that multivariate pattern classification has a decisive performance advantage over univariate classification as the stability of EEG features decreases, as different EEG configurations are used for feature-extraction or as noise is added. Moreover, we show that the generalization performance from Paris to Liège remains stable even if up to 20% of the diagnostic labels are randomly flipped. Finally, consistent with recent literature, analysis of the learned decision rules of our classifier suggested that markers related to dynamic fluctuations in theta and alpha frequency bands carried independent information and were most influential. Our findings demonstrate that EEG markers of consciousness can be reliably, economically and automatically identified with machine learning in various clinical and acquisition contexts.

Is this you? Claim your profile.

Top publicationsby citations