GOLDMARK: Governed Outcome-Linked Diagnostic Model Assessment Reference Kit.
Abstract
Background: (MIL) with pathology foundation models (PFMs) has become the standard baseline for CB development. While these methods, with architectural and optimization advances, have improved predictive performance, computational pathology lacks standardized intermediate data formats, provenance tracking, checkpointing conventions, and reproducible evaluation metrics required for clinical-grade deployment. Consequently, discipline-level standardization, including data representation, model versioning, evaluation protocols, and auditability, is essential to enable reliable, scalable, and regulatory-ready clinical translation of CBs. Methods: (www.artificialintelligencepathology.org), a standardized benchmarking framework built on a curated TCGA cohort with clinically anchored OncoKB level 1-3 biomarker labels. GOLDMARK distributes structured intermediate outputs, including tile coordinates, per-slide feature embeddings from canonical PFMs, embedding-level quality-control metadata, trained slide-level weights, and reference code. Multiple publicly available PFMs are benchmarked under a unified attention-based MIL head using predefined patient-level splits. Models are trained on TCGA and evaluated on an independent MSKCC cohort with reciprocal testing. Results: ) and showed the most stable cross-site performance. Differences between canonical encoders were modest relative to task-specific variability. Conclusions: Computational pathology is entering a translational phase in which reproducibility, transparency, and cross-institutional robustness are prerequisites for clinical trust. GOLDMARK establishes a reference framework that separates dataset curation from model evaluation and introduces structured intermediate artifacts, quality-control metadata, and symmetric cross-dataset testing as core components of benchmarking. Such infrastructure is essential for transforming computational biomarkers from research demonstrations into reproducible, clinically trusted workflows.
Related Papers
No related papers found
Powered by citation graph analysis