Diagnostic Accuracy of a Large Language Model (ChatGPT-4) for Patients Admitted to a Community Hospital Medical Intensive Care Unit: A Retrospective Case Study

Jassimran Singh; Rhea Bohra; Vaibhavi Mukhtiar; Warren Fernandes; Charmi Bhanushali; Rajaeaswaran Chinnamuthu; SHIHLA SHIREEN KANAMGODE; June Ellis; Eric S. Silverman

doi:10.1177/08850666251368270

Diagnostic Accuracy of a Large Language Model (ChatGPT-4) for Patients Admitted to a Community Hospital Medical Intensive Care Unit: A Retrospective Case Study

Jassimran Singh(Saint Vincent Hospital), Rhea Bohra(Saint Vincent Hospital), Vaibhavi Mukhtiar(Saint Vincent Hospital), Warren Fernandes(Saint Vincent Hospital), Charmi Bhanushali(Saint Vincent Hospital), Rajaeaswaran Chinnamuthu(Saint Vincent Hospital), SHIHLA SHIREEN KANAMGODE(UMass Memorial Health Care), June Ellis(Saint Vincent Hospital), Eric S. Silverman(Saint Vincent Hospital)

Journal of Intensive Care Medicine

August 17, 2025

10.1177/08850666251368270

Cited by 1

Abstract

BackgroundThe future of artificial intelligence in medicine includes the use of machine learning and large language models to improve diagnostic accuracy, as a point-of-care tool, at the time of admission to an acute care hospital. The large language model, ChatGPT-4, has been shown to diagnose complex medical conditions with accuracies comparable to experienced clinicians, however, most published studies involved curated cases or examination-like questions and are not point-of-care. To test the hypothesis that ChatGPT-4 can make an accurate medical diagnosis using real-world medical cases and a convenient cut and paste strategy, we performed a retrospective case study involving critically ill patients admitted to a community hospital medical intensive care unit.MethodsA redacted H&P was essentially cut and pasted into ChatGPT-4 with uniform instructions to make a leading diagnosis and a list of 5 possibilities as a differential diagnosis. All features that could be used to identify patients were removed to ensure privacy and HIPAA compliance. The ChatGPT-4 diagnoses were compared with critical care physician diagnoses using a blinded longitudinal chart review as the ground truth diagnosis.ResultsA total of 120 randomly selected cases were included in the study. The diagnostic accuracy was 88.3% for physicians and 85.0% for ChatGPT-4, with no significant difference by McNemar testing (p-value of 0.249). The agreement between physician diagnosis and ChatGPT-4 diagnosis was moderate, 0.57 (95% CI: 0.35-0.79), based on Cohen's kappa statistic.ConclusionThese results suggest that ChatGTP-4 achieved diagnostic accuracy comparable to board certified physicians in the context of critically ill patients admitted to a community medical intensive care unit. Furthermore, the agreement was only moderate, suggesting that there may be complementary ways of combining the diagnostic acumen of physicians and ChatGPT-4 to improve overall accuracy. A prospective study would be necessary to determine if ChatGPT-4 could improve patient outcomes as a point-of-care tool at the time of admission.

Related Papers

No related papers found

Powered by citation graph analysis