Update 02/10/2020: Please check this link for an improved version of doc2query
Doc2query: Document Expansion by Query Prediction
This repository contains the code to reproduce our entry to the MSMARCO passage ranking task, which was placed first on April 8th, 2019. The paper describing our implementation is here.
MSMARCO Passage Re-Ranking Leaderboard (Apr 8th 2019) | Eval MRR@10 | Dev MRR@10 |
---|---|---|
1st Place - BERTter Indexing (this code) | 36.8 | 37.5 |
2nd Place - SAN + BERT base | 35.9 | 37.0 |
3rd Place - BERT + Small Training | 35.9 | 36.5 |
Installation
We first need to install OpenNMT so we can train a model to predict queries from documents. We clone from 0.8.2 because that was the version we trained our models. However, feel free to use a newer version, but we cannot guarantee that the commands below will work.
git clone --branch 0.8.2 https://github.com/OpenNMT/OpenNMT-py.git
cd OpenNMT-py
pip install -r requirements.txt
cd ..
We also need to install Anserini, so we can index and retrieve the expanded documents.
sudo apt-get install maven
git clone https://github.com/castorini/Anserini.git
cd Anserini
mvn clean package appassembler:assemble
tar xvfz eval/trec_eval.9.0.4.tar.gz -C eval/ && cd eval/trec_eval.9.0.4 && make
cd ../ndeval && make
cd ../../../
MS MARCO
Data Preprocessing
First, we need to download and extract the MS MARCO dataset:
DATA_DIR=./msmarco_data
mkdir ${DATA_DIR}
wget https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P ${DATA_DIR}
tar -xvf ${DATA_DIR}/collectionandqueries.tar.gz -C ${DATA_DIR}
To confirm, collectionandqueries.tar.gz
should have MD5 checksum of fed5aa512935c7b62787cb68ac9597d6
.
The scripts below convert the data to a format that can be consumed by OpenNMT training and inference scripts:
python ./convert_msmarco_to_opennmt.py \
--collection_path=${DATA_DIR}/collection.tsv \
--train_queries=${DATA_DIR}/queries.train.tsv \
--train_qrels=${DATA_DIR}/qrels.train.tsv \
--dev_queries=${DATA_DIR}/queries.dev.tsv \
--dev_qrels=${DATA_DIR}/qrels.dev.small.tsv \
--output_folder=${DATA_DIR}/opennmt_format
The output files and their number of lines should be:
$ wc -l ./msmarco_data/opennmt_format/*
8841823 ./msmarco_data/opennmt_format/src-collection.txt
7437 ./msmarco_data/opennmt_format/src-dev.txt
532751 ./msmarco_data/opennmt_format/src-train.txt
7437 ./msmarco_data/opennmt_format/tgt-dev.txt
532751 ./msmarco_data/opennmt_format/tgt-train.txt
The last step is to preprocess train and dev files with the following command:
python ./OpenNMT-py/preprocess.py \
-train_src ${DATA_DIR}/opennmt_format/src-train.txt \
-train_tgt ${DATA_DIR}/opennmt_format/tgt-train.txt \
-valid_src ${DATA_DIR}/opennmt_format/src-dev.txt \
-valid_tgt ${DATA_DIR}/opennmt_format/tgt-dev.txt \
-save_data ${DATA_DIR}/opennmt_format/preprocessed \
-src_seq_length 10000 \
-tgt_seq_length 10000 \
-src_seq_length_trunc 400 \
-tgt_seq_length_trunc 100 \
-dynamic_dict \
-share_vocab \
-src_vocab_size 32000 \
-tgt_vocab_size 32000 \
-shard_size 100000
Training doc2query (i.e. a transformer model)
python -u ./OpenNMT-py/train.py \
-data ${DATA_DIR}/opennmt_format/preprocessed \
-save_model ${DATA_DIR}/doc2query \
-layers 6 \
-rnn_size 512 \
-word_vec_size 512 \
-transformer_ff 2048 \
-heads 8 \
-encoder_type transformer \
-decoder_type transformer \
-position_encoding \
-train_steps 10000 \
-max_generator_batches 2 \
-dropout 0.1 \
-batch_size 4096 \
-batch_type tokens \
-normalization tokens \
-accum_count 2 \
-optim adam \
-adam_beta2 0.998 \
-decay_method noam \
-warmup_steps 8000 \
-learning_rate 2.0 \
-max_grad_norm 0.0 \
-param_init 0.0 \
-param_init_glorot \
-label_smoothing 0.1 \
-valid_steps 5000 \
-save_checkpoint_steps 5000 \
-world_size 4 \
-share_embeddings \
-gpu_ranks 0 1 2 3
The command above starts training a transformer model using four GPUs (you can
train with one GPU by setting gpu_ranks 0
and world_size 1
). It should
take approximately 3-6 hours to reach iteration 10,000, which contains the
lowest perplexity on the dev set (~15.2).
We can now evaluate in BLEU points the performance of our model:
python ./OpenNMT-py/translate.py \
-gpu 0 \
-model ${DATA_DIR}/doc2query_step_10000.pt \
-src ${DATA_DIR}/opennmt_format/src-dev.txt \
-tgt ${DATA_DIR}/opennmt_format/tgt-dev.txt \
-output ${DATA_DIR}/opennmt_format/pred-dev.txt \
-replace_unk \
-verbose \
-report_time \
-beam_size 1
perl ./OpenNMT-py/tools/multi-bleu.perl \
${DATA_DIR}/opennmt_format/tgt-dev.txt < ${DATA_DIR}/opennmt_format/pred-dev.txt
The output should be similar to this:
BLEU = 8.82, 35.0/14.3/5.7/2.5 (BP=0.957, ratio=0.958, hyp_len=34050, ref_len=35553)
In case you don't want to train a doc2query model yourself, you can download our trained model here.
Predicting Queries
We use our best model checkpoint (iteration 10,000) to predict 5 queries for each document in the collection:
python ./OpenNMT-py/translate.py \
-gpu 0 \
-model ${DATA_DIR}/doc2query_step_10000.pt \
-src ${DATA_DIR}/opennmt_format/src-collection.txt \
-output ${DATA_DIR}/opennmt_format/pred-collection_beam5.txt \
-batch_size 32 \
-beam_size 5 \
--n_best 5 \
-replace_unk \
-report_time
The step above takes many hours. Alternatively, you can split
src-collection.txt
into multiple files, process them in parallel and merge the
output. For example:
# Split in 9 files, each with a 1M docs.
split -l 1000000 --numeric-suffixes ${DATA_DIR}/opennmt_format/src-collection.txt ${DATA_DIR}/opennmt_format/src-collection.txt
...
# Execute 9 translate.py in parallel, one for each split file.
...
# Merge the predictions into a single file.
cat ${DATA_DIR}/opennmt_format/pred-collection_beam5.txt?? > ${DATA_DIR}/opennmt_format/pred-collection_beam5.txt
In any case, you can download the predicted queries here.
Expanding docs
Next, we need to merge the original documents with the predicted queries into Anserini's jsonl files (which have one json object per line):
python ./convert_collection_to_jsonl.py \
--collection_path=${DATA_DIR}/collection.tsv \
--predictions=${DATA_DIR}/opennmt_format/pred-collection_beam5.txt \
--beam_size=5 \
--output_folder=${DATA_DIR}/collection_jsonl
The above script should generate 9 jsonl files in ${DATA_DIR}/collection_jsonl, each with 1M lines/docs (except for the last one, which should have 841,823 lines).
We can now index these docs as a JsonCollection
using Anserini:
sh ./Anserini/target/appassembler/bin/IndexCollection -collection JsonCollection \
-generator DefaultLuceneDocumentGenerator -threads 9 -input ${DATA_DIR}/collection_jsonl \
-index ${DATA_DIR}/lucene-index-msmarco -optimize
The output message should be something like this:
2019-04-26 07:49:14,549 INFO [main] index.IndexCollection (IndexCollection.java:647) - Total 8,841,823 documents indexed in 00:06:02
Your speed may vary... with a modern desktop machine with an SSD, indexing takes around a minute.
Retrieving and Evaluating the Dev set
Since queries of the dev set are too many (+100k), it would take a long time to retrieve all of them. To speed this up, we use only the queries that are in the qrels file:
python ./Anserini/src/main/python/msmarco/filter_queries.py --qrels=${DATA_DIR}/qrels.dev.small.tsv \
--queries=${DATA_DIR}/queries.dev.tsv --output_queries=${DATA_DIR}/queries.dev.small.tsv
The output queries file should contain 6980 lines.
$ wc -l ${DATA_DIR}/queries.dev.small.tsv
6980 /scratch/rfn216/msmarco_data//queries.dev.small.tsv
We can now retrieve this smaller set of queries.
cd Anserini
python ./src/main/python/msmarco/retrieve.py --index ${DATA_DIR}/lucene-index-msmarco \
--qid_queries ${DATA_DIR}/queries.dev.small.tsv --output ${DATA_DIR}/run.dev.small.tsv --hits 1000
cd ..
Retrieval speed will vary by machine:
On a modern desktop with an SSD, we can get ~0.04 seconds per query (taking about five minutes).
On a slower machine with mechanical disks, the entire process might take as long as a couple of hours.
The option -hits
specifies the of documents per query to be retrieved.
Thus, the output file should have approximately 6980 * 1000 = 6.9M lines.
Finally, we can evaluate the retrieved documents using this the official MS MARCO evaluation script:
python ./src/main/python/msmarco/msmarco_eval.py ${DATA_DIR}/qrels.dev.small.tsv ${DATA_DIR}/run.dev.small.tsv
And the output should be like this:
#####################
MRR @10: 0.22155540774093688
QueriesRanked: 6980
#####################
Note that these results are 0.6 higher than the ones in the paper. This is due to better BM25 tuning (b1=0.8, k=0.6).
In case you want to compare your retrieved docs against ours, you can download our retrieved docs here.
Reranking with BERT
Most of the gains come from re-ranking with BERT the passages retrieved with BM25 + Doc2query. To implement BERT re-ranker, we follow the same procedure described in the BERT for Passage Re-ranking repository.
We first need to convert dev queries and retrieved docs into the TFRecord format that will be consumed by BERT:
python convert_msmarco_to_tfrecord.py \
--output_folder=${DATA_DIR}/bert_tfrecord \
--collection_path=${DATA_DIR}/collection.tsv \
--vocab=vocab.txt \
--queries=${DATA_DIR}/queries.dev.small.tsv \
--run=${DATA_DIR}/run.dev.small.tsv \
--qrels=${DATA_DIR}/qrels.dev.small.tsv
This script above produces the files dataset.tf
and query_doc_ids.txt
, and
they should be moved to a folder in the Google Cloud Storage. For you convenience, you can download these files here.
We are now ready to use our Google's Colab to re-rank with BERT.
Because we did not see any difference from training BERT with the expanded vs original docs, we simple re-rank dev queries using the same checkpoint from the BERT for Passage Re-ranking repository, that is, no training is required in this step.
The Colab is configured to use TPUs and it should take 5-10 hours to re-rank all 6980 dev set queries. If you use a GPU, expect this step to be 10x longer.
After it finishes, we can download the run file msmarco_predictions_dev.tsv
(which is in the Google Storage folder you specified in OUTPUT_DIR) and evaluate it:
python ./src/main/python/msmarco/msmarco_eval.py ${DATA_DIR}/qrels.dev.small.tsv ${DATA_DIR}/msmarco_predictions_dev.tsv
The output should be like this:
#####################
MRR @10: 0.3763750170555333
QueriesRanked: 6980
#####################
Note that this MRR@10 is slightly higher than our leadearboard entry, probably because the better tuned BM25.
You can download our run file here.
TREC-CAR
Download our doc2query model trained on TREC-CAR here.
How do I cite this work?
@article{nogueira2019document,
title={Document Expansion by Query Prediction},
author={Nogueira, Rodrigo and Yang, Wei and Lin, Jimmy and Cho, Kyunghyun},
journal={arXiv preprint arXiv:1904.08375},
year={2019}
}