SIGIR19-BERT-IR: Deeper Text Understanding for IR with Contextual Neural Language Modeling
Repo of code and data for SIGIR-19 short paper "Deeper Text Understanding for IR with Contextual Neural Language Modeling"
Abstract: Neural networks provide new possibilities to automatically learn complex language patterns and query-document relations. Neural IR models have achieved promising results in learning query-document relevance patterns, but few explorations have been done on understanding the text content of a query or a document. This paper studies leveraging a recently-proposed contextual neural language model, BERT, to provide deeper text understanding for IR. Experimental results demonstrate that the contextual text representations from BERT are more effective than traditional word embeddings. Compared to bag-of-words retrieval models, the contextual language model can better leverage language structures, bringing large improvements on queries written in natural languages. Combining the text understanding ability with search knowledge leads to an enhanced pre-trained BERT model that can benefit related search tasks where training data are limited.
Nov 27, 2019: We received several very good questions in'Issues'. Check there for information about data preprocesssing/post-processing! -Zhuyun
Data
Data can be downloaded from our Virtual Appendix.
The input to the BERT re-ranker is a list of .trec.with_json files. Each line is in the form of:
qid Q0 docid ranke score runname # {"doc":{"title":"...", "body":"......"}}
.
E.g. A document:
80 Q0 clueweb09-en0008-49-09144 1 -5.66498569 indri # {doc": {"title": "Personal Keyboards reviews - Keyboard-Reviews.com", "body": "personal keyboards reviews accessories bass guitars , a..."}
A passage:
80 Q0 clueweb09-en0008-49-09144_passage-0 1 -5.66498569 passage # {"doc": {"title": "Personal Keyboards reviews - Keyboard-Reviews.com", "body": "personal keyboards reviews...}}
We release these .trec.with_json files for ClueWeb09-B. We cannot release the document contents of Robust04 documents, but here is a small sample of Robust04 .trec.with_json file. As an alternative, we provide the inital rankings for ClueWeb09/Robust04 (.trec files). Each line is the format of:
qid Q0 docid rank score runname
You need to get the text contents of candidate documents and append them to the trec file in json format
({doc":{"title":"...", "body":"......"}}
).
Once you have generated the .trec.with_json files for documents, you can use the provided passage generation script to generate passages
Google Colab notebooks to train BERT
You can upload the .trec.with_json files to Google cloud bucket, and directly run the notebooks:
- ClueWeb09-B Document Level Train/Inference (BERT-FirstP)
- ClueWeb09-B Passage Level Train/Inference (BERT-MaxP, BERT-SumP)
- Robust04 Document Level Train/Inference (BERT-FirstP
- Robust04 Passage Level Train/Inference (BERT-maxP, BERT-SumP)
The output is a file of scores for each document/passage. It need to be aligned with the document/passage ids in the original .trec.with_json file. We provide scripts for this purpose.
Pre-trained Bing-augmented BERT Model
Some search tasks require both general text understanding (e.g. Honda is a motor company) and more-specific search knowledge (e.g. people want to see special offers about Honda). While pre-trained BERT encodes general language patterns, the search knowledge must be learned from labeled search data. We follow the domain adaptation setting from our WSDM2018 Conv-KNRM work and augmented BERT with search knowledge from a sample of Bing search log.
The Bing-augmented BERT model can be downloaded from our Virtual Appendix