WILDS: A Benchmark of in-the-Wild Distribution Shifts

Pang Wei Koh; Shiori Sagawa; Henrik Marklund; Sang Michael Xie; Marvin Zhang; Akshay Balsubramani; Weihua Hu; Michihiro Yasunaga; Richard L. Phillips; Irena Gao; Tong Lee; Étienne David; Ian Stavness; Wei Guo; Berton Earnshaw; Imran S. Haque; Sara Beery; Jure Leskovec; Anshul Kundaje; Emma Pierson; Sergey Levine; Chelsea Finn; Percy Liang

doi:10.48550/arxiv.2012.07421

WILDS: A Benchmark of in-the-Wild Distribution Shifts

Pang Wei Koh(Stanford University), Shiori Sagawa(Stanford University), Henrik Marklund(Stanford University), Sang Michael Xie(University of California, Berkeley), Marvin Zhang(Stanford University), Akshay Balsubramani(Stanford University), Weihua Hu(Stanford University), Michihiro Yasunaga(Cornell University), Richard L. Phillips(Stanford University), Irena Gao(Stanford University), Tong Lee(Stanford University), Étienne David(University of Saskatchewan), Ian Stavness(The University of Tokyo), Wei Guo(The University of Tokyo), Berton Earnshaw(Recursion (United States)), Imran S. Haque(California Institute of Technology), Sara Beery(Stanford University), Jure Leskovec(Stanford University), Anshul Kundaje(Microsoft Research (United Kingdom)), Emma Pierson(University of California, Berkeley), Sergey Levine(Stanford University), Chelsea Finn(Stanford University), Percy Liang(Stanford University)

CaltechAUTHORS (California Institute of Technology)

December 14, 2020

10.48550/arxiv.2012.07421

Cited by 286Open Access

Full Text

Abstract

Distribution shifts -- where the training distribution differs from the test distribution -- can substantially degrade the accuracy of machine learning (ML) systems deployed in the wild. Despite their ubiquity in the real-world deployments, these distribution shifts are under-represented in the datasets widely used in the ML community today. To address this gap, we present WILDS, a curated benchmark of 10 datasets reflecting a diverse range of distribution shifts that naturally arise in real-world applications, such as shifts across hospitals for tumor identification; across camera traps for wildlife monitoring; and across time and location in satellite imaging and poverty mapping. On each dataset, we show that standard training yields substantially lower out-of-distribution than in-distribution performance. This gap remains even with models trained by existing methods for tackling distribution shifts, underscoring the need for new methods for training models that are more robust to the types of distribution shifts that arise in practice. To facilitate method development, we provide an open-source package that automates dataset loading, contains default model architectures and hyperparameters, and standardizes evaluations. Code and leaderboards are available at https://wilds.stanford.edu.

Related Papers

No related papers found

Powered by citation graph analysis