Range of Radiologist Performance in a Population-based Screening Cohort of 1 Million Digital Mammography Examinations

Mattie Salim(Karolinska University Hospital), Karin Dembrower(Karolinska University Hospital), Martin Eklund(Karolinska University Hospital), Peter Lindholm(Karolinska University Hospital), Fredrik Strand(Karolinska University Hospital)
Radiology
July 28, 2020
Cited by 39

Abstract

Background There is great interest in developing artificial intelligence (AI)–based computer-aided detection (CAD) systems for use in screening mammography. Comparative performance benchmarks from true screening cohorts are needed. Purpose To determine the range of human first-reader performance measures within a population-based screening cohort of 1 million screening mammograms to gauge the performance of emerging AI CAD systems. Materials and Methods This retrospective study consisted of all screening mammograms in women aged 40–74 years in Stockholm County, Sweden, who underwent screening with full-field digital mammography between 2008 and 2015. There were 110 interpreting radiologists, of whom 24 were defined as high-volume readers (ie, those who interpreted more than 5000 annual screening mammograms). A true-positive finding was defined as the presence of a pathology-confirmed cancer within 12 months. Performance benchmarks included sensitivity and specificity, examined per quartile of radiologists’ performance. First-reader sensitivity was determined for each tumor subgroup, overall and by quartile of high-volume reader sensitivity. Screening outcomes were examined based on the first reader’s sensitivity quartile with 10 000 screening mammograms per quartile. Linear regression models were fitted to test for a linear trend across quartiles of performance. Results A total of 418 041 women (mean age, 54 years ± 10 [standard deviation]) were included, and 1 186 045 digital mammograms were evaluated, with 972 899 assessed by high-volume readers. Overall sensitivity was 73% (95% confidence interval [CI]: 69%, 77%), and overall specificity was 96% (95% CI: 95%, 97%). The mean values per quartile of high-volume reader performance ranged from 63% to 84% for sensitivity and from 95% to 98% for specificity. The sensitivity difference was very large for basal cancers, with the least sensitive and most sensitive high-volume readers detecting 53% and 89% of cancers, respectively (P < .001). Conclusion Benchmarks showed a wide range of performance differences between high-volume readers. Sensitivity varied by tumor characteristics. © RSNA, 2020 Online supplemental material is available for this article.


Related Papers

No related papers found

Powered by citation graph analysis