Collaboration between clinicians and vision–language models in radiology report generation

Ryutaro Tanno(Google DeepMind (United Kingdom)), David G. T. Barrett(Google DeepMind (United Kingdom)), Andrew Sellergren(Google (United Kingdom)), Sumedh Ghaisas(Google DeepMind (United Kingdom)), Sumanth Dathathri(Google DeepMind (United Kingdom)), Abigail See(Google DeepMind (United Kingdom)), Johannes Welbl(Google DeepMind (United Kingdom)), Charles T. Lau(Google (United Kingdom)), Tao Tu(Google DeepMind (United Kingdom)), Shekoofeh Azizi(Google DeepMind (United Kingdom)), K. K. Singhal(Google (United Kingdom)), Mike Schaekermann(Google (United Kingdom)), R. May(Google DeepMind (United Kingdom)), Roy Lee(Google (United Kingdom)), SiWai Man(Google (United Kingdom)), S. Sara Mahdavi(Google DeepMind (United Kingdom)), Zahra Ahmed(Google DeepMind (United Kingdom)), Yossi Matias(Google (United Kingdom)), Joëlle Barral(Google DeepMind (United Kingdom)), S. M. Ali Eslami(Google DeepMind (United Kingdom)), Danielle Belgrave(GlaxoSmithKline (United Kingdom)), Yun Liu(Google (United Kingdom)), Sreenivasa Raju Kalidindi, Shravya Shetty(Google (United Kingdom)), Vivek Natarajan(Google (United Kingdom)), Pushmeet Kohli(Google DeepMind (United Kingdom)), Po-Sen Huang(Google DeepMind (United Kingdom)), Alan Karthikesalingam(Google (United Kingdom)), Sofia Ira Ktena(Google DeepMind (United Kingdom))
Nature Medicine
November 7, 2024
Cited by 126Open Access
Full Text

Abstract

Automated radiology report generation has the potential to improve patient care and reduce the workload of radiologists. However, the path toward real-world adoption has been stymied by the challenge of evaluating the clinical quality of artificial intelligence (AI)-generated reports. We build a state-of-the-art report generation system for chest radiographs, called Flamingo-CXR, and perform an expert evaluation of AI-generated reports by engaging a panel of board-certified radiologists. We observe a wide distribution of preferences across the panel and across clinical settings, with 56.1% of Flamingo-CXR intensive care reports evaluated to be preferable or equivalent to clinician reports, by half or more of the panel, rising to 77.7% for in/outpatient X-rays overall and to 94% for the subset of cases with no pertinent abnormal findings. Errors were observed in human-written reports and Flamingo-CXR reports, with 24.8% of in/outpatient cases containing clinically significant errors in both report types, 22.8% in Flamingo-CXR reports only and 14.0% in human reports only. For reports that contain errors we develop an assistive setting, a demonstration of clinician-AI collaboration for radiology report composition, indicating new possibilities for potential clinical utility.


Related Papers

No related papers found

Powered by citation graph analysis