GOLDMARK: Governed Outcome-Linked Diagnostic Model Assessment Reference Kit.

Chad Vanderbilt; Gabriele Campanella; Siddharth Singi; Swaraj Nanda; J Chen; Ali Kamali; Amir Momeni Boroujeni; David Kim; Mohamed A. Yakoub; Jamal Benhamida; Meera Hameed; Neeraj Kumar; Gregory M. Goldgof

GOLDMARK: Governed Outcome-Linked Diagnostic Model Assessment Reference Kit.

Chad Vanderbilt(Memorial Sloan Kettering Cancer Center), Gabriele Campanella(Icahn School of Medicine at Mount Sinai), Siddharth Singi(Memorial Sloan Kettering Cancer Center), Swaraj Nanda(Memorial Sloan Kettering Cancer Center), J Chen(Memorial Sloan Kettering Cancer Center), Ali Kamali(Memorial Sloan Kettering Cancer Center), Amir Momeni Boroujeni(Memorial Sloan Kettering Cancer Center), David Kim(Memorial Sloan Kettering Cancer Center), Mohamed A. Yakoub(Memorial Sloan Kettering Cancer Center), Jamal Benhamida(Memorial Sloan Kettering Cancer Center), Meera Hameed(Memorial Sloan Kettering Cancer Center), Neeraj Kumar(Memorial Sloan Kettering Cancer Center), Gregory M. Goldgof(Memorial Sloan Kettering Cancer Center)

PubMed

March 21, 2026

Cited by 0

Abstract

Background: (MIL) with pathology foundation models (PFMs) has become the standard baseline for CB development. While these methods, with architectural and optimization advances, have improved predictive performance, computational pathology lacks standardized intermediate data formats, provenance tracking, checkpointing conventions, and reproducible evaluation metrics required for clinical-grade deployment. Consequently, discipline-level standardization, including data representation, model versioning, evaluation protocols, and auditability, is essential to enable reliable, scalable, and regulatory-ready clinical translation of CBs. Methods: (www.artificialintelligencepathology.org), a standardized benchmarking framework built on a curated TCGA cohort with clinically anchored OncoKB level 1-3 biomarker labels. GOLDMARK distributes structured intermediate outputs, including tile coordinates, per-slide feature embeddings from canonical PFMs, embedding-level quality-control metadata, trained slide-level weights, and reference code. Multiple publicly available PFMs are benchmarked under a unified attention-based MIL head using predefined patient-level splits. Models are trained on TCGA and evaluated on an independent MSKCC cohort with reciprocal testing. Results: ) and showed the most stable cross-site performance. Differences between canonical encoders were modest relative to task-specific variability. Conclusions: Computational pathology is entering a translational phase in which reproducibility, transparency, and cross-institutional robustness are prerequisites for clinical trust. GOLDMARK establishes a reference framework that separates dataset curation from model evaluation and introduces structured intermediate artifacts, quality-control metadata, and symmetric cross-dataset testing as core components of benchmarking. Such infrastructure is essential for transforming computational biomarkers from research demonstrations into reproducible, clinically trusted workflows.

Related Papers

No related papers found

Powered by citation graph analysis