Large language models encode clinical knowledge

Karan Singhal(Google (United States)), Shekoofeh Azizi(Google (United States)), Tao Tu(Google (United States)), S. Sara Mahdavi(Google (United States)), Jason Lee(Google (United States)), Hyung Won Chung(Google (United States)), Nathan Scales(Google (United States)), Ajay Kumar Tanwani(Google (United States)), Heather Cole-Lewis(Google (United States)), Stephen Pfohl(Google (United States)), Perry W. Payne(Google (United States)), Martin Seneviratne(Google (United States)), Paul Gamble(Google (United States)), Christopher Kelly(Google (United States)), Abubakr Babiker(Google (United States)), Nathanael Schärli(Google (United States)), Aakanksha Chowdhery(Google (United States)), P. Mansfield(Google (United States)), Dina Demner‐Fushman(United States National Library of Medicine), Blaise Agüera y Arcas(Google (United States)), Dale R. Webster(Google (United States)), Greg S. Corrado(Google (United States)), Yossi Matias(Google (United States)), Katherine Chou(Google (United States)), Juraj Gottweis(Google (United States)), Nenad Tomašev(Google DeepMind (United Kingdom)), Yun Liu(Google (United States)), Alvin Rajkomar(Google (United States)), Joëlle Barral(Google (United States)), Christopher Semturs(Google (United States)), Alan Karthikesalingam(Google (United States)), Vivek Natarajan(Google (United States))
Nature
July 12, 2023
Cited by 3,055Open Access
Full Text

Abstract

Abstract Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model 1 (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM 2 on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA 3 , MedMCQA 4 , PubMedQA 5 and Measuring Massive Multitask Language Understanding (MMLU) clinical topics 6 ), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today’s models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.


Related Papers

No related papers found

Powered by citation graph analysis