Efficient Open-Domain Question Answering

The official website for the open domain question answering challenge at NeurIPS 2020.

Sign up for notifications
Task definition, data, and evaluation
Get the data
Making a submission

Getting Started with Baselines

We have provided a number of baseline systems, to help you get started with this challenge.

Retrieval-based (TF-IDF / DPR)

We provide two retrieval-based baselines:

Note that for both baselines, we use text blocks of 100 words as passages and a BERT-base multi-passage reader. See more details in the DPR paper.

We provide two variants for each model, using (1) full Wikipedia (full) and (2) A subset of Wikipedia articles which are found relevant to the questions on the train data (subset). In particular, we think of subset as a naive way to reduce the disk memory usage for the retrieval-based baselines.

Model Exact Mach Disk usage (gb)
TFIDF-full 32.0 20.1
TFIDF-subset 31.0 2.8
DPR-full 41.0 66.4
DPR-subset 34.8 5.9

Details on training and testing the models are available at EfficientQA Retrieval-base Baselines Repo.

Getting ready

git clone https://github.com/facebookresearch/DPR.git && cd DPR && pip3 install .
cd ..
git clone https://github.com/efficientqa/retrieval-based-baselines.git && cd retrieval-based-baselines && pip3 install -r requirements.txt

Let ${base_dir} as your base directory to store data and pretrained models. (For example, run export base_dir="data".)

Running end-to-end prediction (shortcut)

chmod +x run.sh
./run.sh tfidf-full # for TFIDF-full
./run.sh tfidf-subset # for TFIDF-subset
./run.sh dpr-full # for DPR-full
./run.sh dpr-subset # for DPR-subset

Running end-to-end prediction (detail)

Step 1: Download data

wget https://raw.githubusercontent.com/efficientqa/nq-open/master/NQ-open.dev.jsonl
python3 ../DPR/data/download_data.py --resource data.wikipedia_split --output_dir ${base_dir} # Wikipedia DB
# (For "subset" variants) create a new DB with a subset of Wikipedia passages
python3 ../DPR/data/download_data.py --resource data.retriever.nq-train --output_dir ${base_dir}
python3 filter_subset_wiki.py --db_path ${base_dir}/data/wikipedia_split/psgs_w100.tsv --data_path ${base_dir}/data/retriever/nq-train.json

Step 2: Download indexes and pretrained models

# For TFIDF-full
python3 ../DPR/data/download_data.py --resource indexes.tfidf.nq.full --output_dir ${base_dir} # TFIDF index
python3 ../DPR/data/download_data.py --resource checkpoint.reader.nq-tfidf.hf-bert-base --output_dir ${base_dir} # reader checkpoint
export tfidf_index="${base_dir}/indexes/tfidf/nq/full.npz"
export reader_checkpoint="${base_dir}/checkpoint/reader/nq-tfidf/hf-bert-base.cp"

# For TFIDF-subset
python3 ../DPR/data/download_data.py --resource indexes.tfidf.nq.subset --output_dir ${base_dir} # TFIDF index
python3 ../DPR/data/download_data.py --resource checkpoint.reader.nq-tfidf-subset.hf-bert-base --output_dir ${base_dir} # reader checkpoint
export tfidf_index="${base_dir}/indexes/tfidf/nq/subset.npz"
export reader_checkpoint="${base_dir}/checkpoint/reader/nq-tfidf-subset/hf-bert-base.cp"

# For DPR-full
python3 ../DPR/data/download_data.py --resource checkpoint.retriever.single.nq.bert-base-encoder --output_dir ${base_dir} # retrieval checkpoint
python3 ../DPR/data/download_data.py --resource indexes.single.nq.full --output_dir ${base_dir} # DPR index
python3 ../DPR/data/download_data.py --resource checkpoint.reader.nq-single.hf-bert-base --output_dir ${base_dir} # reader checkpoint
export dpr_retrieval_checkpoint="${base_dir}/checkpoint/retriever/single/nq/bert-base-encoder.cp"
export dpr_index="${base_dir}/indexes/single/nq/full"
export reader_checkpoint="${base_dir}/checkpoint/reader/nq-single/hf-bert-base.cp"

# For DPR-subset
python3 ../DPR/data/download_data.py --resource checkpoint.retriever.single.nq.bert-base-encoder --output_dir ${base_dir} # retrieval checkpoint
python3 ../DPR/data/download_data.py --resource indexes.single.nq.subset --output_dir ${base_dir} # DPR index
python3 ../DPR/data/download_data.py --resource checkpoint.reader.nq-single-subset.hf-bert-base --output_dir ${base_dir} # reader checkpoint
export dpr_retrieval_checkpoint="${base_dir}/checkpoint/retriever/single/nq/bert-base-encoder.cp"
export dpr_index="${base_dir}/indexes/single/nq/subset"
export reader_checkpoint="${base_dir}/checkpoint/reader/nq-single-subset/hf-bert-base.cp"

Step 3: Run inference script to save predictions from the questions

python3 run_inference.py \
  --qa_file NQ-open.dev.jsonl \ # data file with questions
  --retrieval_type {tfidf|dpr} \ # which retrieval to use
  --db_path ${base_dir}/data/wikipedia_split/{psgs_w100.tsv|psgs_w100_subset.tsv} \
  --tfidf_path ${tfidf_index} \ # only matters for TFIDF retrieval
  --dpr_model_file ${dpr_retrieval_checkpoint} \ # only matters for dpr retrieval
  --dense_index_path ${dpr_index} \ # only matters for dpr retrieval
  --model_file ${reader_checkpoint} \ # path to the reader checkpoint
  --dev_batch_size 8 \ # 8 is good for one 32gb GPU
  --pretrained_model_cfg bert-base-uncased --encoder_model_type hf_bert --do_lower_case \
  --sequence_length 350 --eval_top_docs 10 20 40 50 80 100 \
  --passages_per_question_predict 100 \ # 100 for TFIDF, 40 for DPR
  --prediction_results_file dev_predictions.json # path to save predictions; comparable to the official evaluation script

Parameteric (T5)

This section provides pointers on how to reproduce the experiments on the purely generative approach to “closed-book question answering” with T5 detailed in How Much Knowledge Can You Pack Into the Parameters of a Language Model? (Roberts, et al. 2020).

This approach fine-tunes a large language model (T5) to solve QA tasks using only the knowledge stored in its parameters based on unsupervised pre-training over a large corpus based on Common Crawl (C4). Below are the results and disk usage (Docker image size) for various model sizes and configurations.

Model Exact Mach Disk usage (gb)
T5-1.1-small + SSM 25.8 0.43
T5-1.1-XL + SSM 35.6 5.60
T5-1.1-XXL + SSM 37.9 22.0


Instructions for fine-tuning the T5 model on Natural Questions via command-line can be found in the T5 Closed Book QA github repo, along with already fine-tuned checkpoints. The repository itself demonstrates how to create new pre-training and fine-tuning tasks with the base T5 library. An example for how to interactively call the T5 repo be found in the T5 Colab, which also trains on Natural Questions but lacks some of the improvements implemented in the Closed Book QA repo.


While a CPU is sufficient for inference, some variants of T5 require significant compute resources to pre-train or fine-tune. Colab provides a free v2-8 TPU, which is powerful enough to finetune the XL/3B model or below. For XXL/11B, you will need a v3-8 TPU or larger.

It is also possible to fine-tune using one or more GPUs following instructions for the TensorFlow implementation or HuggingFace’s PyTorch implementation. Note that the PyTorch implementation does not support model parallelism, and is therefore incompatible with the XXL/11B model.


Running inference over files is demonstrated in the T5 Colab and command line instructions.

Here we demonstrate how to export a fine-tuned checkpoint as a SavedModel, which reduces its size and enables more efficient inference in an interactive setting. You can also run through this commands in the provided Colab notebook.

First, in your shell execute the following commands to export the model:

# Install t5
pip install -qU t5


# Select one of the models below by un-commenting it.

# Export the model.
t5_mesh_transformer \
  --model_dir="gs://t5-data/pretrained_models/cbqa/${MODEL}" \
  --use_model_api \
  --mode="export" \

Now that the SavedModel has been exported, we can load it in a Python interpreter for interactive inference:

import os
import tensorflow as tf
import tensorflow_text  # Required to run exported model.

EXPORT_DIR = '/tmp/t5_exports'

saved_model_path = os.path.join(EXPORT_DIR, max(os.listdir(EXPORT_DIR)))

model = tf.saved_model.load(saved_model_path, ["serve"])

def predict_fn(x):
 return model.signatures['serving_default'](tf.constant(x))['outputs'].numpy()

def answer(question):
  return predict_fn([question])[0].decode('utf-8')

Now, let’s ask it some questions! Note we prefix each question with nq question: prompt since T5 is a multitask model.

for question in ["nq question: where is google's headquarters",
                 "nq question: what is the most populous country in the world",
                 "nq question: name a member of the beatles",
                 "nq question: how many teeth do humans have"]:

COMING SOON: An example of how to package a model checkpoint into a Docker image for submission and serving.