Stress testing reveals gaps in clinic readiness of image-based diagnostic artificial intelligence models

Albert T. Young; Kristen Fernandez; Jacob Pfau; Rasika Reddy; Nhat Anh Cao; Max von Franque; Arjun Johal; Benjamin Wu; Rachel R. Wu; Jennifer Y. Chen; Raj P. Fadadu; Juan A. Vasquez; Andrew Tam; Michael J. Keiser; Maria L. Wei

doi:10.1038/s41746-020-00380-6

Stress testing reveals gaps in clinic readiness of image-based diagnostic artificial intelligence models

Albert T. Young(University of California, San Francisco), Kristen Fernandez(University of California, San Francisco), Jacob Pfau(University of California, San Francisco), Rasika Reddy(University of California, San Francisco), Nhat Anh Cao(San Francisco VA Health Care System), Max von Franque(San Francisco VA Health Care System), Arjun Johal(University of California, San Francisco), Benjamin Wu(San Francisco VA Health Care System), Rachel R. Wu(San Francisco VA Health Care System), Jennifer Y. Chen(San Francisco VA Health Care System), Raj P. Fadadu(University of California, San Francisco), Juan A. Vasquez(San Francisco VA Health Care System), Andrew Tam(San Francisco VA Health Care System), Michael J. Keiser(University of California, San Francisco), Maria L. Wei(University of California, San Francisco)

npj Digital Medicine

January 21, 2021

10.1038/s41746-020-00380-6

Cited by 52Open Access

Full Text

Abstract

Artificial intelligence models match or exceed dermatologists in melanoma image classification. Less is known about their robustness against real-world variations, and clinicians may incorrectly assume that a model with an acceptable area under the receiver operating characteristic curve or related performance metric is ready for clinical use. Here, we systematically assessed the performance of dermatologist-level convolutional neural networks (CNNs) on real-world non-curated images by applying computational "stress tests". Our goal was to create a proxy environment in which to comprehensively test the generalizability of off-the-shelf CNNs developed without training or evaluation protocols specific to individual clinics. We found inconsistent predictions on images captured repeatedly in the same setting or subjected to simple transformations (e.g., rotation). Such transformations resulted in false positive or negative predictions for 6.5-22% of skin lesions across test datasets. Our findings indicate that models meeting conventionally reported metrics need further validation with computational stress tests to assess clinic readiness.

Related Papers

No related papers found

Powered by citation graph analysis