Toward expert-level medical question answering with large language models

K. K. Singhal(Google (United States)), Tao Tu(Google (United States)), Juraj Gottweis(Google (United States)), Rory Sayres(Google (United States)), Ellery Wulczyn(Google (United States)), Mohamed Amin(Google (United States)), Le Hou(Google (United States)), Kevin Clark(Google (United States)), Stephen Pfohl(Google (United States)), Heather Cole-Lewis(Google (United States)), Darlene Neal(Google (United States)), Qazi Mamunur Rashid(Google (United States)), Mike Schaekermann(Google (United States)), Amy Wang(Google (United States)), Dev Dash(Stanford University), Jonathan H. Chen(Stanford Medicine), Nigam H. Shah(Stanford Health Care), Sami Lachgar(Google (United States)), P. Mansfield(Google (United States)), Sushant Prakash(Google (United States)), Bradley Green(Google (United States)), Ewa Dominowska(Google (United States)), Blaise Agüera y Arcas(Google (United States)), Nenad Tomašev(Google (United States)), Yun Liu(Google (United States)), Renee Wong(Google (United States)), Christopher Semturs(Google (United States)), S. Sara Mahdavi(Google (United States)), Joëlle Barral(Google (United States)), Dale R. Webster(Google (United States)), Greg S. Corrado(Google (United States)), Yossi Matias(Google (United States)), Shekoofeh Azizi(Google (United States)), Alan Karthikesalingam(Google (United States)), Vivek Natarajan(Google (United States))
Nature Medicine
January 8, 2025
Cited by 677Open Access
Full Text

Abstract

Large language models (LLMs) have shown promise in medical question answering, with Med-PaLM being the first to exceed a 'passing' score in United States Medical Licensing Examination style questions. However, challenges remain in long-form medical question answering and handling real-world workflows. Here, we present Med-PaLM 2, which bridges these gaps with a combination of base LLM improvements, medical domain fine-tuning and new strategies for improving reasoning and grounding through ensemble refinement and chain of retrieval. Med-PaLM 2 scores up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19%, and demonstrates dramatic performance increases across MedMCQA, PubMedQA and MMLU clinical topics datasets. Our detailed human evaluations framework shows that physicians prefer Med-PaLM 2 answers to those from other physicians on eight of nine clinical axes. Med-PaLM 2 also demonstrates significant improvements over its predecessor across all evaluation metrics, particularly on new adversarial datasets designed to probe LLM limitations (P < 0.001). In a pilot study using real-world medical questions, specialists preferred Med-PaLM 2 answers to generalist physician answers 65% of the time. While specialist answers were still preferred overall, both specialists and generalists rated Med-PaLM 2 to be as safe as physician answers, demonstrating its growing potential in real-world medical applications.


Related Papers

No related papers found

Powered by citation graph analysis