Natural Questions: A Benchmark for Question Answering Research

Tom Kwiatkowski; Jennimaria Palomaki; Olivia Redfield; Michael Collins; Ankur P. Parikh; Chris Alberti; Danielle Epstein; Illia Polosukhin; Jacob Devlin; Kenton Lee; Kristina Toutanova; Llion Jones; Matthew Kelcey; Ming‐Wei Chang; Andrew M. Dai; Jakob Uszkoreit; Quoc V. Le; Slav Petrov

doi:10.1162/tacl_a_00276

Natural Questions: A Benchmark for Question Answering Research

Tom Kwiatkowski(Google (United States)), Jennimaria Palomaki(Google (United States)), Olivia Redfield(Google (United States)), Michael Collins(Google (United States)), Ankur P. Parikh(Google (United States)), Chris Alberti(Google (United States)), Danielle Epstein(Google (United States)), Illia Polosukhin(Google (United States)), Jacob Devlin(Google (United States)), Kenton Lee(Google (United States)), Kristina Toutanova(Google (United States)), Llion Jones(Google (United States)), Matthew Kelcey(Google (United States)), Ming‐Wei Chang(Google (United States)), Andrew M. Dai(Google (United States)), Jakob Uszkoreit(Google (United States)), Quoc V. Le(Google (United States)), Slav Petrov(Google (United States))

Transactions of the Association for Computational Linguistics

August 2, 2019

10.1162/tacl_a_00276

Cited by 1,980Open Access

Full Text

Abstract

We present the Natural Questions corpus, a question answering data set. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations; 7,830 examples with 5-way annotations for development data; and a further 7,842 examples with 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature.

Related Papers

No related papers found

Powered by citation graph analysis