irAE-GPT: Leveraging large language models to identify immune-related adverse events in electronic health records and clinical trial datasets

Cosmin A. Bejan; Michelle Wang; Sriram Venkateswaran; Ewa A. Bergmann; Laura Hiles; Yaomin Xu; G. Scott Chandler; Sam Brondfield; Jordyn Silverstein; Francis Wright; Kimberly de Dios; Daniel Kim; Eric Mukherjee; Matthew S. Krantz; Lydia Yao; Douglas B. Johnson; Elizabeth J. Phillips; Justin M. Balko; Rajat Mohindra; Zoe Quandt

doi:10.1016/j.ebiom.2026.106227

irAE-GPT: Leveraging large language models to identify immune-related adverse events in electronic health records and clinical trial datasets

Cosmin A. Bejan(Vanderbilt University Medical Center), Michelle Wang(University of California, San Francisco), Sriram Venkateswaran(Roche (Switzerland)), Ewa A. Bergmann(Roche (Switzerland)), Laura Hiles(Roche (United Kingdom)), Yaomin Xu(Vanderbilt University Medical Center), G. Scott Chandler(Roche (Switzerland)), Sam Brondfield(University of California, San Francisco), Jordyn Silverstein(University of California, Los Angeles), Francis Wright(University of Colorado Anschutz Medical Campus), Kimberly de Dios(University of California, San Francisco), Daniel Kim(Cedars-Sinai Medical Center), Eric Mukherjee(Vanderbilt University Medical Center), Matthew S. Krantz(Vanderbilt University Medical Center), Lydia Yao(Vanderbilt University Medical Center), Douglas B. Johnson(Vanderbilt University Medical Center), Elizabeth J. Phillips(Murdoch University), Justin M. Balko(Vanderbilt University Medical Center), Rajat Mohindra(Roche (Switzerland)), Zoe Quandt(University of California, San Francisco)

EBioMedicine

March 6, 2025

10.1016/j.ebiom.2026.106227

Cited by 9Open Access

Full Text

Abstract

BACKGROUND: Large language models (LLMs) have emerged as transformative technologies, revolutionising natural language understanding and generation across various domains, including medicine. In this study, we investigated the capabilities, limitations, and generalisability of Generative Pre-trained Transformer (GPT) models in analysing unstructured patient notes from large healthcare datasets to identify immune-related adverse events (irAEs) associated with the use of immune checkpoint inhibitor (ICI) therapy. METHODS: We evaluated the performance of GPT-3.5, GPT-4, and GPT-4o models on manually annotated datasets of patients receiving ICI therapy, sampled from two electronic health record (EHR) systems and seven clinical trials. A zero-shot prompt was designed to exhaustively identify irAEs at both the patient level (main analysis) and the note level (secondary analysis). The LLM-based system followed a multi-label classification approach to identify any combination of irAEs associated with individual patients or clinical notes. System evaluation was conducted for each available irAE as well as for broader categories of irAEs classified at the organ level. FINDINGS: Our analysis included 442 patients across three institutions. The most common irAEs manually identified in the patient datasets included pneumonitis (N = 64), colitis (N = 56), rash (N = 32), and hepatitis (N = 28). The GPT models demonstrated generalisable abilities in identifying irAEs across EHRs and clinical trial reports. Overall, the models achieved relatively high sensitivity and specificity but only moderate positive predictive values, reflecting a potential bias towards overpredicting irAE outcomes. GPT-4o achieved the highest F1 and micro-averaged F1 scores for both patient-level and note-level evaluations. Highest performance was observed in the haematological (F1 range = 1.0-1.0), gastrointestinal (F1 range = 0.81-0.85), and musculoskeletal and rheumatologic (F1 range = 0.67-1.0) irAE categories. Error analysis uncovered substantial limitations of GPT models in handling textual causation, where adverse events should not only be accurately identified in clinical text but also causally linked to immune checkpoint inhibitors. INTERPRETATION: This study demonstrated that GPT models can automate the detection of immune related adverse events in varied healthcare datasets, reducing the burden on physicians and other healthcare professionals by limiting the need for manual review. This capability will accelerate the generation of safety insights across large healthcare datasets and facilitate the characterisation of patient-level drivers of toxicities, thus enhancing safety monitoring and ultimately improving patient care. FUNDING: National Institutes of Health, Roche, National Health and Medical Research Council of Australia, Stevens-Johnson Syndrome Foundation, Angela Anderson Research Fund, Larry L Hillblom Foundation and UCSF Research Allocation Program.

Related Papers

No related papers found

Powered by citation graph analysis