irAE-GPT: Leveraging large language models to identify immune-related adverse events in electronic health records and clinical trial datasets
Abstract
BACKGROUND: Large language models (LLMs) have emerged as transformative technologies, revolutionising natural language understanding and generation across various domains, including medicine. In this study, we investigated the capabilities, limitations, and generalisability of Generative Pre-trained Transformer (GPT) models in analysing unstructured patient notes from large healthcare datasets to identify immune-related adverse events (irAEs) associated with the use of immune checkpoint inhibitor (ICI) therapy. METHODS: We evaluated the performance of GPT-3.5, GPT-4, and GPT-4o models on manually annotated datasets of patients receiving ICI therapy, sampled from two electronic health record (EHR) systems and seven clinical trials. A zero-shot prompt was designed to exhaustively identify irAEs at both the patient level (main analysis) and the note level (secondary analysis). The LLM-based system followed a multi-label classification approach to identify any combination of irAEs associated with individual patients or clinical notes. System evaluation was conducted for each available irAE as well as for broader categories of irAEs classified at the organ level. FINDINGS: Our analysis included 442 patients across three institutions. The most common irAEs manually identified in the patient datasets included pneumonitis (N = 64), colitis (N = 56), rash (N = 32), and hepatitis (N = 28). The GPT models demonstrated generalisable abilities in identifying irAEs across EHRs and clinical trial reports. Overall, the models achieved relatively high sensitivity and specificity but only moderate positive predictive values, reflecting a potential bias towards overpredicting irAE outcomes. GPT-4o achieved the highest F1 and micro-averaged F1 scores for both patient-level and note-level evaluations. Highest performance was observed in the haematological (F1 range = 1.0-1.0), gastrointestinal (F1 range = 0.81-0.85), and musculoskeletal and rheumatologic (F1 range = 0.67-1.0) irAE categories. Error analysis uncovered substantial limitations of GPT models in handling textual causation, where adverse events should not only be accurately identified in clinical text but also causally linked to immune checkpoint inhibitors. INTERPRETATION: This study demonstrated that GPT models can automate the detection of immune related adverse events in varied healthcare datasets, reducing the burden on physicians and other healthcare professionals by limiting the need for manual review. This capability will accelerate the generation of safety insights across large healthcare datasets and facilitate the characterisation of patient-level drivers of toxicities, thus enhancing safety monitoring and ultimately improving patient care. FUNDING: National Institutes of Health, Roche, National Health and Medical Research Council of Australia, Stevens-Johnson Syndrome Foundation, Angela Anderson Research Fund, Larry L Hillblom Foundation and UCSF Research Allocation Program.
Related Papers
No related papers found
Powered by citation graph analysis