Task Definition, Data, and Evaluation

This competition has three restrained tracks that take into account each submission’s size, as well as the accuracy of its predictions. In the restrained tracks, the goal is to build:

the most accurate self-contained question answering system under 6Gb,
the most accurate self-contained question answering system under 500Mb,
the smallest self-contained question answering system that achieves 25% accuracy.

There is also a single unrestrained track where the goal is to build the most accurate question answering system, regardless of size.

Submissions to this competition will be evaluated in two phases. In the first phase, submissions will be automatically evaluated against reference answers from five annotators and this automatic evaluation will be used to rank submissions on a leaderboard.

At the completion of the submission period, predictions from the top ranked systems on the leaderboard will be sent to human raters, to identify correct predictions that did not appear in the set of reference answers. The human verified accuracy will be used to re-rank each track’s winners and this re-ranking will be presented and discussed at the NeurIPS 2020 event.

More complete details of this competition’s data and evaluation are given below. To receive notification of the leaderboard’s launch, planned for July 2020 please sign up here. And, if you have any technical questions about the data or evaluation, please raise an issue in this repository.

Data

This competition will be evaluated using a the open-domain variant of the Natural Questions dataset, generated by extracting all short answers using ORQA’s data processing script. We will use the original Natural Questions dataset splits for training and evaluation. We will perform final evaluation of all systems using a new, previously unseen, test set that is drawn from the same distribution as the development set.

You can get the processed training and development sets from the NQ-open data repository.

Other Data

Aside from the task-specific data provided above, submissions may make use of any other data that is publicly available.

We know that all of the questions in our evaluation set are answerable using Wikipedia, and we expect most participants will make use of Wikipedia as a retrieval corpus, pre-training dataset, or both. Previous work has also made use of the WikiData knowledge graph, and that is encouraged.

Participants may use any other public data resources, including other question answering datasets such as TriviaQA, MsMarco, WebQuestions, and TREC.

Evaluating Accuracy

Automatic Evaluation

Each of the questions in the evaluation set is paired with a set of reference answers. The goal is to predict an answer string that matches one of these reference answers, after both the prediction and the reference have been normalized using this answer normalization function.

You can download our evaluation code here.

If you have any questions about the automatic evaluation, please raise a GitHub issue in this repository and we will get back to you.

Human Verification of Answers

Our evaluation data contains reference answers from five annotators. In some cases, these will not be complete, since there are many different ways to phrase the same answer, and sometimes multiple different answers are acceptable, depending on context.

At the conclusion of the submission period the top performing systems, according to our automatic evaluation, will be evaluated manually.

You can find a more thorough discussion of ambiguity and incompleteness in Natural Questions in Min et.al. 2020. For updates on the design and implementation of the human verification process, please sign up here.

Measuring System Sizes

System size is measured as the number of bytes required to store a Docker image that contains the complete, self-contained question answering system. This Docker image must include all code, libraries, parameters, and data required by your system. It will not be allowed to make calls to any external APIs, and it will not be allowed to download external data.

We will add tutorials for building small, self-contained Docker images to this site shortly. To be notified when this happens, please sign up here.

Evaluating on the Test Set

All participants can start developing their systems now, with the training and development data provided above.

The final competition rankings will be judged according to performance on a new, unseen test set drawn from the same distribution as the development data. All submissions from the restricted tracks will be run on this test set in a sandboxed environment, and their performance will be reported on a public leaderboard. To be receive a notification of the leaderboard launch, please sign up to our mailing list.

Participants in the unrestricted track will be provided with the test set questions, after the final submissions to the restricted track leaderboards have been made. Participants in the unrestricted tracks will then have 48 hours to run their systems on the test set questions, and submit answer predictions. The test set predictions will be used to calculate the final system’s rank. To be notified when the test set predictions are released, please sign up to our mailing list.