Automated real-world data integration improves cancer outcome prediction

Justin Jee(Memorial Sloan Kettering Cancer Center), Christopher J. Fong(Memorial Sloan Kettering Cancer Center), Karl Pichotta(Memorial Sloan Kettering Cancer Center), Thinh Ngoc Tran(Memorial Sloan Kettering Cancer Center), Anisha Luthra(Memorial Sloan Kettering Cancer Center), Michele Waters(Memorial Sloan Kettering Cancer Center), Chenlian Fu(Memorial Sloan Kettering Cancer Center), Mirella L. Altoé(Memorial Sloan Kettering Cancer Center), Siyang Liu(Memorial Sloan Kettering Cancer Center), Steven B. Maron(Memorial Sloan Kettering Cancer Center), Mehnaj Ahmed(Memorial Sloan Kettering Cancer Center), Susie Kim(Memorial Sloan Kettering Cancer Center), Mono Pirun(Memorial Sloan Kettering Cancer Center), Walid K. Chatila(Memorial Sloan Kettering Cancer Center), Ino de Bruijn(Memorial Sloan Kettering Cancer Center), Arfath Pasha(Memorial Sloan Kettering Cancer Center), Ritika Kundra(Memorial Sloan Kettering Cancer Center), Benjamin Groß(Memorial Sloan Kettering Cancer Center), Brooke Mastrogiacomo(Memorial Sloan Kettering Cancer Center), Tyler Aprati(Dana-Farber Cancer Institute), David Liu(Dana-Farber Cancer Institute), Jianjiong Gao(Caris Life Sciences (United States)), Marzia Capelletti(Caris Life Sciences (United States)), Kelly R. Pekala(Memorial Sloan Kettering Cancer Center), Lisa Loudon(Memorial Sloan Kettering Cancer Center), Maria Perry(Memorial Sloan Kettering Cancer Center), Chaitanya Bandlamudi(Memorial Sloan Kettering Cancer Center), Mark T.A. Donoghue(Memorial Sloan Kettering Cancer Center), Baby A. Satravada(Memorial Sloan Kettering Cancer Center), Axel Martin(Memorial Sloan Kettering Cancer Center), Ronglai Shen(Memorial Sloan Kettering Cancer Center), Yuan Chen(Memorial Sloan Kettering Cancer Center), A. Rose Brannon(Memorial Sloan Kettering Cancer Center), Jason C. Chang(Memorial Sloan Kettering Cancer Center), Lior Z. Braunstein(Memorial Sloan Kettering Cancer Center), Anyi Li(Memorial Sloan Kettering Cancer Center), Anton Safonov(Memorial Sloan Kettering Cancer Center), Aaron J. Stonestrom(Memorial Sloan Kettering Cancer Center), Pablo Sánchez Vela(Memorial Sloan Kettering Cancer Center), Clare Wilhelm(Memorial Sloan Kettering Cancer Center), Mark E. Robson(Memorial Sloan Kettering Cancer Center), Howard I. Scher(Memorial Sloan Kettering Cancer Center), Marc Ladanyi(Memorial Sloan Kettering Cancer Center), Jorge S. Reis‐Filho(Memorial Sloan Kettering Cancer Center), David B. Solit(Memorial Sloan Kettering Cancer Center), David R. Jones(Memorial Sloan Kettering Cancer Center), Daniel R. Gomez(Memorial Sloan Kettering Cancer Center), Helena A. Yu(Memorial Sloan Kettering Cancer Center), Debyani Chakravarty(Memorial Sloan Kettering Cancer Center), Rona Yaeger(Memorial Sloan Kettering Cancer Center), Wassim Abida(Memorial Sloan Kettering Cancer Center), Wungki Park(Memorial Sloan Kettering Cancer Center), Eileen M. O’Reilly(Memorial Sloan Kettering Cancer Center), Julio García‐Aguilar(Memorial Sloan Kettering Cancer Center), Nicholas D. Socci(Memorial Sloan Kettering Cancer Center), Francisco Sánchez-Vega(Memorial Sloan Kettering Cancer Center), Jian Carrot‐Zhang(Memorial Sloan Kettering Cancer Center), Peter D. Stetson(Memorial Sloan Kettering Cancer Center), Ross L. Levine(Memorial Sloan Kettering Cancer Center), Charles M. Rudin(Memorial Sloan Kettering Cancer Center), Michael F. Berger(Memorial Sloan Kettering Cancer Center), Sohrab P. Shah(Memorial Sloan Kettering Cancer Center), Deborah Schrag(Memorial Sloan Kettering Cancer Center), Pedram Razavi(Memorial Sloan Kettering Cancer Center), Kenneth L. Kehl(Dana-Farber Cancer Institute), Bob T. Li(Memorial Sloan Kettering Cancer Center), Gregory J. Riely(Memorial Sloan Kettering Cancer Center), Nikolaus Schultz(Memorial Sloan Kettering Cancer Center), MSK Cancer Data Science Initiative Group(Memorial Sloan Kettering Cancer Center), Aaron Lisman(Memorial Sloan Kettering Cancer Center), Gaofei Zhao(Memorial Sloan Kettering Cancer Center), Ino de Bruijn(Memorial Sloan Kettering Cancer Center), Walid K. Chatila(Memorial Sloan Kettering Cancer Center), Xiang Li(Memorial Sloan Kettering Cancer Center), Aarman Kohli(Memorial Sloan Kettering Cancer Center), Darin Moore(Memorial Sloan Kettering Cancer Center), Raymond Lim(Memorial Sloan Kettering Cancer Center), Tom Pollard(Memorial Sloan Kettering Cancer Center), Robert E. Sheridan(Memorial Sloan Kettering Cancer Center), Avery Wang(Memorial Sloan Kettering Cancer Center), Calla Chennault(Memorial Sloan Kettering Cancer Center), Manda Wilson(Memorial Sloan Kettering Cancer Center), Hongxin Zhang(Memorial Sloan Kettering Cancer Center), Robert Pimienta(Memorial Sloan Kettering Cancer Center), Surya Narayana Rangavajhala(Memorial Sloan Kettering Cancer Center), Guru Subramanian(Memorial Sloan Kettering Cancer Center), J.A. Valverde García(Memorial Sloan Kettering Cancer Center), Naveen Rachuri(Memorial Sloan Kettering Cancer Center), Kevin Boehm(Memorial Sloan Kettering Cancer Center), Mitchell I. Parker(Memorial Sloan Kettering Cancer Center), Henry Walch(Memorial Sloan Kettering Cancer Center), Subhiksha Nandakumar(Memorial Sloan Kettering Cancer Center), Jordan Eichholz(Memorial Sloan Kettering Cancer Center), Ayush Kris(Memorial Sloan Kettering Cancer Center), Paolo Manca(Memorial Sloan Kettering Cancer Center), Xuechun Bai(Memorial Sloan Kettering Cancer Center), Tejiri Agbamu(Memorial Sloan Kettering Cancer Center), U Justin(Memorial Sloan Kettering Cancer Center), Xiao-Jun Bi(Memorial Sloan Kettering Cancer Center)
Nature
November 6, 2024
Cited by 144Open Access
Full Text

Abstract

The digitization of health records and growing availability of tumour DNA sequencing provide an opportunity to study the determinants of cancer outcomes with unprecedented richness. Patient data are often stored in unstructured text and siloed datasets. Here we combine natural language processing annotations1,2 with structured medication, patient-reported demographic, tumour registry and tumour genomic data from 24,950 patients at Memorial Sloan Kettering Cancer Center to generate a clinicogenomic, harmonized oncologic real-world dataset (MSK-CHORD). MSK-CHORD includes data for non-small-cell lung (n = 7,809), breast (n = 5,368), colorectal (n = 5,543), prostate (n = 3,211) and pancreatic (n = 3,109) cancers and enables discovery of clinicogenomic relationships not apparent in smaller datasets. Leveraging MSK-CHORD to train machine learning models to predict overall survival, we find that models including features derived from natural language processing, such as sites of disease, outperform those based on genomic data or stage alone as tested by cross-validation and an external, multi-institution dataset. By annotating 705,241 radiology reports, MSK-CHORD also uncovers predictors of metastasis to specific organ sites, including a relationship between SETD2 mutation and lower metastatic potential in immunotherapy-treated lung adenocarcinoma corroborated in independent datasets. We demonstrate the feasibility of automated annotation from unstructured notes and its utility in predicting patient outcomes. The resulting data are provided as a public resource for real-world oncologic research. A study generates a clinicogenomics dataset resource, MSK-CHORD, that combines natural language processing-derived clinical annotations with patient medical data from various sources to improve models of cancer outcome.


Related Papers

No related papers found

Powered by citation graph analysis