Henrik Marklund

CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison

Jeremy Irvin, Pranav Rajpurkar, Michael Ko et al.|Proceedings of the AAAI Conference on Artificial Intelligence|2019

Cited by 391Open Access

Large, labeled datasets have driven deep learning methods to achieve expert-level performance on a variety of medical imaging tasks. We present CheXpert, a large dataset that contains 224,316 chest radiographs of 65,240 patients. We design a labeler to automatically detect the presence of 14 observations in radiology reports, capturing uncertainties inherent in radiograph interpretation. We investigate different approaches to using the uncertainty labels for training convolutional neural networks that output the probability of these observations given the available frontal and lateral radiographs. On a validation set of 200 chest radiographic studies which were manually annotated by 3 board-certified radiologists, we find that different uncertainty approaches are useful for different pathologies. We then evaluate our best model on a test set composed of 500 chest radiographic studies annotated by a consensus of 5 board-certified radiologists, and compare the performance of our model to that of 3 additional radiologists in the detection of 5 selected pathologies. On Cardiomegaly, Edema, and Pleural Effusion, the model ROC and PR curves lie above all 3 radiologist operating points. We release the dataset to the public as a standard benchmark to evaluate performance of chest radiograph interpretation models.

WILDS: A Benchmark of in-the-Wild Distribution Shifts

Pang Wei Koh, Shiori Sagawa, Henrik Marklund et al.|CaltechAUTHORS (California Institute of Technology)|2020

Cited by 286Open Access

Distribution shifts -- where the training distribution differs from the test distribution -- can substantially degrade the accuracy of machine learning (ML) systems deployed in the wild. Despite their ubiquity in the real-world deployments, these distribution shifts are under-represented in the datasets widely used in the ML community today. To address this gap, we present WILDS, a curated benchmark of 10 datasets reflecting a diverse range of distribution shifts that naturally arise in real-world applications, such as shifts across hospitals for tumor identification; across camera traps for wildlife monitoring; and across time and location in satellite imaging and poverty mapping. On each dataset, we show that standard training yields substantially lower out-of-distribution than in-distribution performance. This gap remains even with models trained by existing methods for tackling distribution shifts, underscoring the need for new methods for training models that are more robust to the types of distribution shifts that arise in practice. To facilitate method development, we provide an open-source package that automates dataset loading, contains default model architectures and hyperparameters, and standardizes evaluations. Code and leaderboards are available at https://wilds.stanford.edu.

Adaptive Risk Minimization: A Meta-Learning Approach for Tackling Group Shift

Marvin Mengxin Zhang, Henrik Marklund, Nikita Dhawan et al.|arXiv (Cornell University)|2021

Cited by 39Open Access

A fundamental assumption of most machine learning algorithms is that the training and test data are drawn from the same underlying distribution. However, this assumption is violated in almost all practical applications: machine learning systems are regularly tested under distribution shift, due to temporal correlations, particular end users, or other factors. In this work, we consider the setting where the training data are structured into groups and test time shifts correspond to changes in the group distribution. Prior work has approached this problem by attempting to be robust to all possible test time distributions, which may degrade average performance. In contrast, we propose to use ideas from meta-learning to learn models that are adaptable, such that they can adapt to shift at test time using a batch of unlabeled test points. We acquire such models by learning to adapt to training batches sampled according to different distributions, which simulate structural shifts that may occur at test time. Our primary contribution is to introduce the framework of adaptive risk minimization (ARM), a formalization of this setting that lends itself to meta-learning. We develop meta-learning methods for solving the ARM problem, and compared to a variety of prior methods, these methods provide substantial gains on image classification problems in the presence of shift.

Deep learning assistance for the histopathologic diagnosis of Helicobacter pylori

Sharon Zhou, Henrik Marklund, Ondřej Bláha et al.|Intelligence-Based Medicine|2020

Cited by 31Open Access

Deep learning (DL), a sub-area of artificial intelligence, has demonstrated great promise at automating diagnostic tasks in pathology, yet its translation into clinical settings has been slow. Few studies have examined its impact on pathologist performance, when embedded into clinical workflows. The identification of H. pylori on H&E stain is a tedious, imprecise task which might benefit from DL assistance. In this study, a DL assistant was developed to diagnose H. pylori in gastric biopsies, and its impact on pathologist diagnostic accuracy and turnaround time was tested. H&E-stained whole-slide images (WSI) of 303 gastric biopsies with ground truth confirmation by immunohistochemistry formed the study dataset; 47 and 126 WSI were respectively used to train and optimize the DL assistant to detect H. pylori, and 130 were used in a clinical experiment in which 3 experienced GI pathologists reviewed the same test set with and without assistance. On the test set, the assistant achieved high performance, with a WSI-level area under the receiver-operating-characteristic curve (AUROC) of 0.965 (95% CI 0.934–0.987). On H. pylori-positive cases, assisted diagnoses were faster (βˆ, the fixed effect size for assistance = −0.557, p = 0.003) and much more accurate (OR = 13.37, p < 0.001) than unassisted diagnoses. However, assistance increased diagnostic uncertainty on H. pylori-negative cases, resulting in an overall decrease in assisted accuracy (OR = 0.435, p = 0.016) and negligible impact on overall turnaround time (βˆ for assistance = 0.010, p = 0.860). DL can assist pathologists with H. pylori diagnosis, but its integration into clinical workflows requires optimization to mitigate diagnostic uncertainty as a potential consequence of assistance.

Adaptive Risk Minimization: Learning to Adapt to Domain Shift

Marvin Mengxin Zhang, Henrik Marklund, Nikita Dhawan et al.|arXiv (Cornell University)|2020

Cited by 28Open Access

A fundamental assumption of most machine learning algorithms is that the training and test data are drawn from the same underlying distribution. However, this assumption is violated in almost all practical applications: machine learning systems are regularly tested under distribution shift, due to changing temporal correlations, atypical end users, or other factors. In this work, we consider the problem setting of domain generalization, where the training data are structured into domains and there may be multiple test time shifts, corresponding to new domains or domain distributions. Most prior methods aim to learn a single robust model or invariant feature space that performs well on all domains. In contrast, we aim to learn models that adapt at test time to domain shift using unlabeled test points. Our primary contribution is to introduce the framework of adaptive risk minimization (ARM), in which models are directly optimized for effective adaptation to shift by learning to adapt on the training domains. Compared to prior methods for robustness, invariance, and adaptation, ARM methods provide performance gains of 1-4% test accuracy on a number of image classification problems exhibiting domain shift.

Is this you? Claim your profile.

Top publicationsby citations