The gene normalization task in BioCreative III

Zhiyong Lu(National Center for Biotechnology Information), Hung‐Yu Kao(National Cheng Kung University), Chih-Hsuan Wei(National Cheng Kung University), Minlie Huang(Tsinghua University), Jingchen Liu(Tsinghua University), Cheng-Ju Kuo(Institute of Information Science, Academia Sinica), Chun‐Nan Hsu(University of Southern California), Richard Tzong‐Han Tsai(Yuan Ze University), Hong-Jie Dai(Academia Sinica), Naoaki Okazaki(The University of Tokyo), Han-Cheol Cho(The University of Tokyo), Martin Gerner(University of Manchester), Illés Solt(Budapest University of Technology and Economics), Shashank Agarwal(University of Wisconsin–Milwaukee), Feifan Liu(University of Wisconsin–Milwaukee), Dina Vishnyakova(University of Geneva), Patrick Ruch, Martin Romacker(Novartis (Switzerland)), Fabio Rinaldi(University of Zurich), Sanmitra Bhattacharya(University of Iowa), Padmini Srinivasan(University of Iowa), Hongfang Liu(Mayo Clinic), Manabu Torii(Georgetown University Medical Center), Sérgio Matos(University of Aveiro), David Campos(University of Aveiro), Karin Verspoor(University of Colorado Denver), Kevin Livingston(University of Colorado Denver), W. John Wilbur(National Center for Biotechnology Information)
BMC Bioinformatics
October 3, 2011
Cited by 183Open Access
Full Text

Abstract

BACKGROUND: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k). RESULTS: We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively. CONCLUSIONS: By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance.


Related Papers

No related papers found

Powered by citation graph analysis