Dense Retrieval Papers

A collection of papers related to dense retrieval.

The arrangement of papers refers to our survey "Dense Text Retrieval based on Pretrained Language Models: A Survey".

If you find our survey useful for your research, please cite the following paper:

@article{DRSurvey,
    title={Dense Text Retrieval based on Pretrained Language Models: A Survey},
    author={Wayne Xin Zhao, Jing Liu, Ruiyang Ren, Ji-Rong Wen},
    year={2022},
    journal={arXiv preprint arXiv:2211.14876}
}

Survey paper
Architecture
Training
Indexing
Interation with Re-ranking
Advanced Topics
Applications
Datasets
Libraries

Survey Paper

Paper	Author	Venue	Code
Pretrained Transformers for Text Ranking: BERT and Beyond.	Jimmy Lin et al.	Synthesis HLT 2021	NA
Semantic Models for the First-stage Retrieval: A Comprehensive Review.	Yinqiong Cai et al.	Arxiv 2021	NA
Pre-training Methods in Information Retrieval.	Yixing Fan et al.	Arxiv 2021	NA
A Deep Look into Neural Ranking Models for Information Retrieval.	Jiafeng Guo et al.	Inf. Process. Manag. 2020	NA
Lecture Notes on Neural Information Retrieval.	Nicola Tonellotto.	Arxiv 2022	NA
Low-Resource Dense Retrieval for Open-Domain Question Answering: A Comprehensive Survey.	Xiaoyu Shen et al.	Arxiv 2022	NA

Architecture

Paper	Author	Venue	Code
Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring.	Samuel Humeau et al.	ICLR 2020	Python
Sparse, Dense, and Attentional Representations for Text Retrieval.	Yi Luan et al.	TACL 2021	Python
ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.	Omar Khattab et al.	SIGIR 2020	Python
Query Embedding Pruning for Dense Retrieval.	Nicola Tonellotto et al.	CIKM 2021	Python
Context-Aware Term Weighting For First Stage Passage Retrieval.	Zhuyun Dai et al.	SIGIR 2020	Python
Context-Aware Document Term Weighting for Ad-Hoc Search.	Zhuyun Dai et al.	WWW 2020	Python
DC-BERT: Decoupling Question and Document for Efficient Contextual Encoding.	Yuyu Zhang et al.	SIGIR 2020	NA
Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index.	Minjoon Seo et al.	ACL 2019	Python
Learning Dense Representations of Phrases at Scale.	Jinhyuk Lee et al.	ACL 2021	Python
Phrase Retrieval Learns Passage Retrieval, Too.	Jinhyuk Lee et al.	EMNLP 2021	Python
Dense Hierarchical Retrieval for Open-Domain Question Answering.	Ye Liu et al.	EMNLP 2021	Python
The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes.	Nils Reimers et al.	ACL 2021	NA
Predicting Efficiency/Effectiveness Trade-offs for Dense vs. Sparse Retrieval Strategy Selection.	Negar Arabzadeh et al.	CIKM 2021	Python
Boosted Dense Retriever.	Patrick Lewis et al.	Arxiv 2021	NA
PARM: A Paragraph Aggregation Retrieval Model for Dense Document-to-Document Retrieval.	Sophia Althammer et al.	ECIR 2022	Python
Sparsifying Sparse Representations for Passage Retrieval by Top-k Masking.	Jheng-Hong Yang et al.	Arxiv 2021	NA
Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval.	Hongyin Tang et al.	ACL 2021	NA
ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction.	Keshav Santhanam et al.	Arxiv 2021	Python
GNN-encoder: Learning a Dual-encoder Architecture via Graph Neural Networks for Dense Passage Retrieval.	Jiduan Liu et al.	Arxiv 2022	NA
Sentence-aware Contrastive Learning for Open-Domain Passage Retrieval.	Bohong Wu et al.	ACL 2022	Python
Aggretriever: A Simple Approach to Aggregate Textual Representation for Robust Dense Passage Retrieval.	Sheng-Chieh Lin et al.	Arxiv 2022	Python
DPTDR: Deep Prompt Tuning for Dense Passage Retrieval.	Zhengyang Tang et al.	Arxiv 2022	Python
LED: Lexicon-Enlightened Dense Retriever for Large-Scale Retrieval.	Kai Zhang et al.	Arxiv 2022	NA
Task-Aware Specialization for Efficient and Robust Dense Retrieval for Open-Domain Question Answering.	Hao Cheng et al.	Arxiv 2022	NA
COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List.	Luyu Gao et al.	NAACL 2021	Python
A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques.	Jimmy Lin et al.	Arxiv 2021	NA
Pseudo Relevance Feedback with Deep Language Models and Dense Retrievers: Successes and Pitfalls.	Hang Li et al.	Arxiv 2021	NA
Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback.	HongChien Yu et al.	CIKM 2021	Python
Pseudo-Relevance Feedback for Multiple Representation Dense Retrieval.	Xiao Wang et al.	SIGIR 2021	NA
Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback: A Reproducibility Study.	Hang Li et al.	Arxiv 2021	NA
Implicit Feedback for Dense Passage Retrieval: A Counterfactual Approach.	Shengyao Zhuang et al.	Arxiv 2022	Python
Parameter-Efficient Prompt Tuning Makes Generalized and Calibrated Neural Text Retrievers.	Weng Lam Tam et al.	Arxiv 2022	Python
Densifying Sparse Representations for Passage Retrieval by Representational Slicing.	Sheng-Chieh Lin et al.	Arxiv 2021	NA
SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking.	Thibault Formal et al.	SIGIR 2021	Python
SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval.	Thibault Formal et al.	Arxiv 2021	Python
BERT-based Dense Retrievers Require Interpolation with BM25 for Effective Passage Retrieval.	Shuai Wang et al.	ICTIR 2021	Python
A White Box Analysis of ColBERT.	Thibault Formal et al.	ECIR 2021	NA
Towards Axiomatic Explanations for Neural Ranking Models.	Michael Völske et al.	ICTIR 2021	Python
ABNIRML: Analyzing the Behavior of Neural IR Models.	Sean MacAvaney et al.	Arxiv 2020	Python

Training

Formulation

Paper	Author	Venue	Code
More Robust Dense Retrieval with Contrastive Dual Learning.	Yizhi Li et al.	ICTIR 2021	Python
PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval.	Ruiyang Ren et al.	ACL 2021	Python
xMoCo: Cross Momentum Contrastive Learning for Open-Domain Question Answering.	Nan Yang et al.	ACL 2021	NA
A Modern Perspective on Query Likelihood with Deep Generative Retrieval Models.	Oleg Lesota et al.	ICTIR 2021	Python
Learning Diverse Document Representations with Deep Query Interactions for Dense Retrieval.	Zehan Li et al.	Arxiv 2022	Python
Shallow pooling for sparse labels.	Negar Arabzadeh et al.	Arxiv 2021	NA
Hard Negatives or False Negatives: Correcting Pooling Bias in Training Neural Ranking Models.	Yinqiong Cai et al.	Arxiv 2022	NA
Debiased Contrastive Learning of Unsupervised Sentence Representations.	Kun Zhou et al.	ACL 2022	NA

Negative Selection

Paper	Author	Venue	Code
Learning To Retrieve: How to Train a Dense Retrieval Model Effectively and Efficiently.	Jingtao Zhan et al.	Arxiv 2020	NA
Dense Passage Retrieval for Open-Domain Question Answering.	Vladimir Karpukhin et al.	EMNLP 2020	Python
RepBERT: Contextualized Text Embeddings for First-Stage Retrieval.	Jingtao Zhan et al.	Arxiv 2020	Python
Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval.	Lee Xiong et al.	ICLR 2021	Python
Optimizing Dense Retrieval Model Training with Hard Negatives.	Jingtao Zhan et al.	SIGIR 2021	Python
Neural Passage Retrieval with Improved Negative Contrast.	Jing Lu et al.	Arxiv 2020	NA
RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering.	Yingqi Qu et al.	NAACL 2021	Python
Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling.	Sebastian Hofstätter et al.	SIGIR 2021	Python
Scaling deep contrastive learning batch size under memory limited setup.	Luyu Gao et al.	RepL4NLP 2021	Python
Multi-stage training with improved negative contrast for neural passage retrieval.	Jing Lu et al.	EMNLP 2021	NA
Learning robust dense retrieval models from incomplete relevance labels.	Prafull Prakash et al.	SIGIR 2021	Python
Efficient Training of Retrieval Models Using Negative Cache.	Erik M. Lindgren et al.	NeurIPS 2021	Python
CODER: An efficient framework for improving retrieval through COntextual Document Embedding Reranking.	George Zerveas et al.	Arxiv 2021	NA
Curriculum Learning for Dense Retrieval Distillation.	Hansi Zeng et al.	SIGIR 2022	Python
SimANS: Simple Ambiguous Negatives Sampling for Dense Text Retrieval.	Kun Zhou et al.	EMNLP 2022	Python

Data Augmentation

Paper	Author	Venue	Code
UniK-QA: Unified Representations of Structured and Unstructured Knowledge for Open-Domain Question Answering.	Barlas Oguz et al.	Arxiv 2021	NA
Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks.	Nandan Thakur et al.	NAACL 2021	Python
Is Retriever Merely an Approximator of Reader?	Sohee Yang et al.	Arxiv 2020	NA
Distilling Knowledge from Reader to Retriever for Question Answering.	Gautier Izacard et al.	ICLR 2021	Python
Distilling Knowledge for Fast Retrieval-based Chat-bots.	Amir Vakili Tahami et al.	SIGIR 2020	Python
Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation.	Sebastian Hofstätter et al.	Arxiv 2020	Python
Distilling Dense Representations for Ranking using Tightly-Coupled Teachers.	Sheng-Chieh Lin et al.	Arxiv 2020	Python
In-Batch Negatives for Knowledge Distillation with Tightly-Coupled Teachers for Dense Retrieval.	Sheng-Chieh Lin et al.	RepL4NLP 2021	Python
Neural Retrieval for Question Answering with Cross-Attention Supervised Data Augmentation.	Yinfei Yang et al.	ACL 2021	NA
Enhancing Dual-Encoders with Question and Answer Cross-Embeddings for Answer Retrieval.	Yanmeng Wang et al.	EMNLP 2021	NA
Pseudo Label based Contrastive Sampling for Long Text Retrieval.	Le Zhu et al.	IALP 2021	NA
Multi-View Document Representation Learning for Open-Domain Dense Retrieval.	Shunyu Zhang et al.	ACL 2022	NA
Augmenting Document Representations for Dense Retrieval with Interpolation and Perturbation.	Soyeong Jeong et al.	ACL 2022	Python
ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder via Self On-the-fly Distillation for Dense Passage Retrieval.	Yuxiang Lu et al.	Arxiv 2022	NA
Pro-KD: Progressive Distillation by Following the Footsteps of the Teacher.	Mehdi Rezagholizadeh et al.	COLING 2022	NA
Questions Are All You Need to Train a Dense Passage Retriever.	Devendra Singh Sachan et al.	Arxiv 2022	Python
PROD: Progressive Distillation for Dense Retrieval.	Zhenghao Lin et al.	Arxiv 2022	NA
Answering Open-Domain Questions of Varying Reasoning Steps from Text.	Peng Qi et al.	EMNLP 2021	Python
Multi-Task Retrieval for Knowledge-Intensive Tasks.	Jean Maillard et al.	ACL 2021	NA

Pre-training

Paper	Author	Venue	Code
Latent Retrieval for Weakly Supervised Open Domain Question Answering.	Kenton Lee et al.	ACL 2019	Python
Pre-training tasks for embedding-based large scale retrieval.	Wei-Cheng Chang et al.	ICLR 2020	NA
PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval.	Xinyu Ma et al.	WSDM 2021	Python
B-PROP: Bootstrapped Pre-training with Representative Words Prediction for Ad-hoc Retrieval.	Xinyu Ma et al.	SIGIR 2021	NA
Domain-matched Pre-training Tasks for Dense Retrieval.	Barlas Oguz et al.	Arxiv 2021	NA
Less is More: Pre-train a Strong Text Encoder for Dense Retrieval Using a Weak Decoder.	Shuqi Lu et al.	EMNLP 2021	Python
Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models.	Jianmo Ni et al.	Arxiv 2021	Python
Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval.	Luyu Gao et al.	ACL 2022	Python
Condenser: a Pre-training Architecture for Dense Retrieval.	Luyu Gao et al.	EMNLP 2021	Python
TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning.	Kexin Wang et al.	EMNLP 2021	Python
SimCSE: Simple Contrastive Learning of Sentence Embeddings.	Tianyu Gao et al.	EMNLP 2021	Python
Towards Robust Neural Retrieval Models with Synthetic Pre-Training.	Revanth Gangi Reddy et al.	Arxiv 2021	NA
Hyperlink-induced Pre-training for Passage Retrieval in Open-domain Question Answering.	Jiawei Zhou et al.	ACL 2022	Python
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.	Nils Reimer et al.	EMNLP 2019	Python
Introducing Neural Bag of Whole-Words with ColBERTer: Contextualized Late Interactions using Enhanced Reduction.	Sebastian Hofstätter et al.	Arxiv 2022	Python
Learning to Retrieve Passages without Supervision.	Ori Ram et al.	Arxiv 2021	Python
Text and Code Embeddings by Contrastive Pre-Training.	Arvind Neelakantan et al.	Arxiv 2022	NA
Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction.	Xinyu Ma et al.	SIGIR 2022	Python
RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder.	Shitao Xiao et al.	CoRR 2022
SIMLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval.	Liang Wang et al.	CoRR 2022	Python
Masked Autoencoders As The Unified Learners For Pre-Trained Sentence Representation.	Alexander Liu et al.	CoRR 2022	NA
LEXMAE: Lexicon-BottleNecked Pretraining fot Large-scale Retrieval.	Tao Shen et al.	Arxiv 2022	NA
PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them.	Patrick Lewis et al.	Arxiv 2021	Python
End-to-End Synthetic Data Generation for Domain Adaptation of Question Answering Systems.	Siamak Shakeri et al.	EMNLP 2020	NA
ConTextual Mask Auto-Encoder for Dense Passage Retrieval.	Xing Wu et al.	Arxiv 2022	Python
A Contrastive Pre-training Approach to Learn Discriminative Autoencoder for Dense Retrieval.	Xinyu Ma et al.	Arxiv 2022	NA

Indexing

Paper	Author	Venue	Code
Learning Passage Impacts for Inverted Indexes.	Antonio Mallia et al.	SIGIR 2021	Python
Accelerating Large-Scale Inference with Anisotropic Vector Quantization.	Ruiqi Guo et al.	Arxiv 2019	Python
Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance.	Jingtao Zhan et al.	CIKM 2021	Python
Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval.	Jingtao Zhan et al.	WSDM 2022	Python
Joint Learning of Deep Retrieval Model and Product Quantization based Embedding Index.	Han Zhang et al.	SIGIR 2021	Python
Efficient Passage Retrieval with Hashing for Open-domain Question Answering.	Ikuya Yamada et al.	ACL 2021	Python
A Memory Efficient Baseline for Open Domain Question Answering.	Gautier Izacard et al.	Arxiv 2020	NA
Simple and Effective Unsupervised Redundancy Elimination to Compress Dense Vectors for Passage Retrieval.	Xueguang Ma et al.	EMNLP 2021	Python
The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes.	Nils Reimers et al.	ACL 2021	NA
Matching-oriented Product Quantization For Ad-hoc Retrieval.	Shitao Xiao et al.	EMNLP 2021	Python
Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval.	Shitao Xiao et al.	WWW 2022	NA
Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS).	Anshumali Shrivastava et al.	NeuraIPS 2014	NA
ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms.	Martin Aumüller et al.	SISAP 2017	NA
Results of the NeurIPS’21 Challenge on Billion-Scale Approximate Nearest Neighbor Search.	Harsha Vardhan Simhadri et al.	Arxiv 2022	Python
Interpreting Dense Retrieval as Mixture of Topics.	Jingtao Zhan et al.	Arxiv 2021	NA
The Web Is Your Oyster - Knowledge-Intensive NLP against a Very Large Web Corpus.	Aleksandra Piktus et al.	CoRR 2021	NA
Bi-Phase Enhanced IVFPQ for Time-Efficient Ad-hoc Retrieval.	Peitian Zhang et al.	Arxiv 2022	NA

Interation with Re-ranking

Paper	Author	Venue	Code
RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking.	Ruiyang Ren et al.	EMNLP 2021	Python
Dealing with Typos for BERT-based Passage Retrieval and Ranking.	Shengyao Zhuang et al.	EMNLP 2021	Python
Trans-Encoder: Unsupervised sentence-pair modelling through self- and mutual-distillations.	Fangyu Liu et al.	ICLR 2022	Python
Adversarial Retriever-Ranker for dense text retrieval.	Hang Zhang et al.	Arxiv 2021	NA
Embedding-based Retrieval in Facebook Search.	Jui-Ting Huang et al.	KDD 2020	NA
Passage Re-ranking With BERT.	Rodrigo Nogueira et al.	Arxiv 2019	NA
Understanding the Behaviors of BERT in Ranking.	Yifan Qiao et al.	CoRR 2019	NA
Multi-passage BERT: A Globally Normalized BERT Model for Open-domain Question Answering.	Zhiguo Wang et al.	Arxiv 2019	NA
TOWARDS ROBUST RANKER FOR TEXT RETRIEVAL.	Yucheng Zhou et al.	Arxiv 2022	Python
Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline.	Luyu Gao et al.	ECIR 2021	NA
Multi-Stage Document Ranking with BERT.	Rodrigo Nogueira et al.	CoRR 2019	NA

Advanced Topics

Zero-shot Dense Retrieval

Paper	Author	Venue	Code
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models.	Nandan Thakur et al.	NIPS 2021	Python
A Thorough Examination on Zero-shot Dense Retrieval.	Ruiyang Ren et al.	Arxiv 2022	NA
Challenges in Generalization in Open Domain Question Answering.	Linqing Liu et al.	NAACL 2022	Python
Zero-shot Neural Passage Retrieval via Domain-targeted Synthetic Question Generation.	Ji Ma et al.	Arxiv 2021	NA
Efficient Retrieval Optimized Multi-task Learning.	Hengxin Fun et al.	Arxiv 2021	NA
Zero-Shot Dense Retrieval with Momentum Adversarial Domain Invariant Representations.	Ji Xin et al.	ACL 2022	NA
Towards Robust Neural Retrieval Models with Synthetic Pre-Training.	Revanth Gangi Reddy et al.	Arxiv 2021	NA
Embedding-based Zero-shot Retrieval through Query Generation.	Davis Liang et al.	Arxiv 2020	NA
GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval.	Kexin Wang et al.	Arxiv 2021	Python
Salient Phrase Aware Dense Retrieval: Can a Dense Retriever Imitate a Sparse One?	Xilun Chen et al.	Arxiv 2021	Python
LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval.	Canwen Xu et al.	ACL 2022	Python
Out-of-Domain Semantics to the Rescue! Zero-Shot Hybrid Retrieval Models.	Tao Chen et al.	ECIR 2022	NA
Towards Unsupervised Dense Information Retrieval with Contrastive Learning.	Gautier Izacard et al.	Arxiv 2021	NA
Large Dual Encoders Are Generalizable Retrievers.	Jianmo Ni et al.	Arxiv 2021	NA
KILT: a Benchmark for Knowledge Intensive Language Tasks.	Fabio Petroni et al.	Arxiv 2020	Python
Promptagator: Few-shot Dense Retrieval From 8 Examples.	Zhuyun Dai et al.	Arxiv 2022	NA

Improving the Robustness to Query Variations

Paper	Author	Venue	Code
Towards Robust Dense Retrieval via Local Ranking Alignment.	Xuanang Chen et al.	IJCAI 2022	Python
CharacterBERT and Self-Teaching for Improving the Robustness of Dense Retrievers on Queries with Typos.	Shengyao Zhuang et al.	SIGIR 2022	Python
Evaluating the Robustness of Retrieval Pipelines with Query Variation Generators.	Gustavo Penha et al.	ECIR 2022	Python
Retrieval Consistency in the Presence of Query Variations.	Peter Bailey et al.	SIGIR 2017	NA
Analysing the Robustness of Dual Encoders for Dense Retrieval Against Misspellings.	Peter Bailey et al.	Arxiv 2022	Shell
A Survey of Automatic Query Expansion in Information Retrieval.	Claudio Carpineto et al.	CSUR 2012	NA
BERT Rankers are Brittle: a Study using Adversarial Document Perturbations.	Yumeng Wang et al.	SIGIR 2022	Python
Order-Disorder: Imitation Adversarial Attacks for Black-box Neural Ranking Models.	Jiawei Liu et al.	CoRR 2022	NA

Generative Text Retrieval

Paper	Author	Venue	Code
Transformer Memory as a Diﬀerentiable Search Index.	Yi Tay et al.	Arxiv 2022	NA
DynamicRetriever: A Pre-training Model-based IR System with Neither Sparse nor Dense Index.	Yujia Zhou et al.	Arxiv 2022	NA
Autoregressive Search Engines: Generating Substrings as Document Identifiers.	Michele Bevilacqua et al.	Arxiv 2022	Python
Generative Retrieval for Long Sequences.	Hyunji Lee et al.	Arxiv 2022	NA
GERE: Generative Evidence Retrieval for Fact Verification.	Jiangui Chen et al.	SIGIR 2022	Python
Autoregressive Entity Retrieval.	Nicola De Cao et al.	ICLR 2021	Python
Rethinking Search: Making Domain Experts out of Dilettantes.	Donald Metzler et al.	SIGIR 2021	NA
Transformer Memory as a Differentiable Search Index.	Yi Tay et al.	Arxiv 2022	NA
A Neural Corpus Indexer for Document Retrieval.	Yujing Wang et al.	CoRR 2022	NA
Bridging the Gap Between Indexing and Retrieval for Differentiable Search Index with Query Generation.	Shengyao Zhuang et al.	CoRR 2022	Python
Ultron: An Ultimate Retriever on Corpus with a Model-based Indexer.	Yujia Zhou et al.	CoRR 2022	NA
CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks.	Jiangui Chen et al.	Arxiv 2022	Python

Retrieval-Augmented Language Model

Paper	Author	Venue	Code
Generalization through memorization: Nearest neighbor language models.	Urvashi Khandelwa et al.	Arxiv 2020	Python
Adaptive semiparametric language models.	Dani Yogatama et al.	TACL 2021	NA
Improving language models by retrieving from trillions of tokens.	Borgeaud, Sebastian, et al.	Arxiv 2021	NA
REALM: Retrieval-Augmented Language Model Pre-Training.	Kelvin Guu et al.	ICML 2020	Python
Simple and Efficient ways to Improve REALM.	Vidhisha Balachandran et al.	Arxiv 2021	NA
Adaptive Semiparametric Language Models.	Dani Yogatama et al.	TACL 2021	NA
Efficient Nearest Neighbor Language Models.	Junxian He et al.	EMNLP 2021	Python

Applications

Information Retrieval Applications

Paper	Author	Venue	Code
Multi-modal Retrieval of Tables and Texts Using Tri-encoder Models.	Bogdan Kostic et al.	Arxiv 2021	NA
Open Domain Question Answering over Tables via Dense Retrieval.	Jonathan Herzig et al.	NAACL 2021	Python
SituatedQA: Incorporating Extra-Linguistic Contexts into QA.	Michael J.Q. Zhang et al.	EMNLP 2021	DATA
XOR QA: Cross-lingual Open-Retrieval Question Answering.	Akari Asai et al.	NAACL 2021	Python
One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval.	Akari Asai et al.	NeurIPS 2021	Python
Evaluating Token-Level and Passage-Level Dense Retrieval Models for Math Information Retrieval.	Wei Zhong et al.	Arxiv 2022	Python
ReACC: A Retrieval-Augmented Code Completion Framework.	Shuai Lu et al.	ACL 2022	Python
Improving Biomedical Information Retrieval with Neural Retrievers.	Man Luo et al.	AAAI 2022	NA

Natural Language Processing Applications

Paper	Author	Venue	Code
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.	Patrick Lewis et al.	Arxiv 2020	NA
Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering.	Gautier Izacard et al.	ECAL 2021	Python
End-to-End Training of Neural Retrievers for Open-Domain Question Answering.	Devendra Singh Sachan et al.	ACL 2021	Python
Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval.	Omar Khattab et al.	NeurIPS 2021	Python
Answering Complex Open-domain Questions with Multi-hop Dense Retrieval.	Wenhan Xiong et al.	ICLR 2021	Python
Learning Dense Representations for Entity Retrieval.	Daniel Gillick et al.	CoNLL 2019	NA
Scalable Zero-shot Entity Linking with Dense Entity Retrieval.	Ledell Wu et al.	EMNLP 2020	Python
Zero-Shot Entity Linking by Reading Entity Descriptions.	Lajanugen Logeswaran et al.	ACL 2019	Python
Retrieval Augmentation Reduces Hallucination in Conversation.	Kurt Shuster et al.	EMNLP 2021	NA
Internet-Augmented Dialogue Generation.	Mojtaba Komeili et al.	ACL 2022	NA
LaMDA: Language Models for Dialog Applications.	Romal Thoppilan et al.	Arxiv 2022	NA

Industrial Practice

Paper	Author	Venue	Code
Pre-trained Language Model for Web-scale Retrieval in Baidu Search.	Yiding Liu et al.	KDD 2021	NA
MOBIUS: Towards the Next Generation of Query-Ad Matching in Baidu’s Sponsored Search.	Miao Fan.	KDD 2019	NA
Uni-Retriever: Towards Learning The Unified Embedding Based Retriever in Bing Sponsored Search.	Jianjin Zhang et al.	Arxiv 2022	NA
Embedding-based Product Retrieval in Taobao Search.	Sen Li et al.	KDD 2021	NA
Que2Search: Fast and Accurate Query and Document Understanding for Search at Facebook.	Yiqun Liu et al.	KDD 2021	NA
DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node.	Suhas Jayaram Subramanya et al.	NeurIPS 2019	Python
SPANN: Highly-efficient Billion-scale Approximate Nearest Neighbor Search.	Qi Chen et al.	NeurIPS 2021	Python
HEARTS: Multi-task Fusion of Dense Retrieval and Non-autoregressive Generation for Sponsored Search.	Bhargav Dodla et al.	Arxiv 2022	NA
Sponsored Search Auctions: Recent Advances and Future Directions.	Tao Qin et al.	TIST 2015	NA
Semantic Retrieval at Walmart.	Alessandro Magnani et al.	KDD 2022	NA

Datasets

Paper	Author	Venue	Link
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models.	Nandan Thakur et al.	NeurIPS 2021	DATA
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset.	Payal Bajaj et al.	NeurIPS 2016	DATA
Natural Questions: a Benchmark for Question Answering Research.	Tom Kwiatkowski et al.	TACL 2019	DATA
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension.	Mandar Joshi et al.	ACL 2017	DATA
mMARCO: A Multilingual Version of the MS MARCO Passage Ranking Dataset.	Luiz Henrique Bonifacio et al.	Arxiv 2021	DATA
TREC 2019 News Track Overview.	Ian Soborof et al.	TREC 2019	DATA
TREC-COVID: rationale and structure of an information retrieval shared task for COVID-19.	Kirk Roberts et al.	J Am Med Inform Assoc. 2020	DATA
A Full-Text Learning to Rank Dataset for Medical Information Retrieval.	Vera Boteva et al.	ECIR 2016	DATA
A Data Collection for Evaluating the Retrieval of Related Tweets to News Articles.	Axel Suarez et al.	ECIR 2018	DATA
Overview of Touché 2020: Argument Retrieval.	Alexander Bondarenko et al.	CLEF 2020	DATA
Retrieval of the Best Counterargument without Prior Topic Knowledge.	Henning Wachsmuth et al.	ACL 2018	DATA
DBpedia-Entity v2: A Test Collection for Entity Search.	Faegheh Hasibi et al.	SIGIR 2017	DATA
ORCAS: 20 Million Clicked Query-Document Pairs for Analyzing Search.	Nick Craswell et al.	CIKM 2020	DATA
TREC 2022 Deep Learning Track Guidelines.	Nick Craswell et al.	TREC 2021	DATA
DuReader_retrieval: A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine.	Yifu Qiu et al.	Arxiv 2022	DATA
SQuAD: 100,000+ Questions for Machine Comprehension of Text.	Pranav Rajpurkar et al.	EMNLP 2016	DATA
HOTPOTQA: A Dataset for Diverse, Explainable Multi-hop Question Answering.	Zhilin Yang et al.	EMNLP 2018	DATA
Semantic Parsing on Freebase from Question-Answer Pairs.	Jonathan Berant et al.	EMNLP 2013	DATA
Modeling of the Question Answering Task in the YodaQA System.	Petr Baudiš et al.	CLEF 2015	DATA
WWW'18 Open Challenge: Financial Opinion Mining and Question Answering.	Macedo Maia et al.	WWW 2018	DATA
An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition.	George Tsatsaronis et al.	BMC Bioinform. 2015	DATA
CQADupStack: A Benchmark Data Set for Community Question-Answering Research.	Doris Hoogeveen et al.	ADCS 2015	DATA
First Quora Dataset Release: Question Pairs.	Shankar Iyer et al.	Webpage	DATA
CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training.	Patrick Huber et al.	NAACL 2022	DATA
FEVER: a Large-scale Dataset for Fact Extraction and VERification.	James Thorne et al.	NAACL 2018	DATA
CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims.	Thomas Diggelmann et al.	NeurIPS 2020	DATA
Fact or Fiction: Verifying Scientific Claims.	David Wadden et al.	EMNLP 2020	DATA
SPECTER: Document-level Representation Learning using Citation-informed Transformers.	Arman Cohan et al.	ACL 2020	DATA
Simple Entity-Centric Questions Challenge Dense Retrievers.	Christopher Sciavolino et al.	EMNLP 2021	DATA
ArchivalQA: A Large-scale Benchmark Dataset for Open Domain Question Answering over Archival News Collections.	Jiexin Wang et al.	Arxiv 2021	NA
Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval.	Dingkun Long et al.	SIGIR 2022	DATA
HOVER: A Dataset for Many-Hop Fact Extraction And Claim Verification.	Yichen Jiang et al.	EMNLP 2020	DATA
TREC 2021 Deep Learning Track Guidelines.	Nick Craswell et al.	NA	NA
MSMarco Chameleons: Challenging the MSMarco Leaderboard with Extremely Obstinate Queries.	Negar Arabzadeh et al.	CIKM 2021	Roff

Libraries

Paper	Author	Venue	Code
RocketQA	---	webpage	Python
Billion-scale similarity search with GPUs.	Jeff Johnson et al.	TBD 2019	Python
Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations.	Jimmy Lin et al.	Arxiv 2021	Python
MatchZoo: A Learning, Practicing, and Developing System for Neural Text Matching.	Jiafeng Guo et al.	SIGIR 2019	Python
Anserini: Enabling the Use of Lucene for Information Retrieval Research.	Peilin Yang et al.	SIGIR 2017	Java
Tevatron: An Efficient and Flexible Toolkit for Dense Retrieval.	Luyu Gao et al.	Arxiv 2022	Python
Asyncval: A Toolkit for Asynchronously Validating Dense Retriever Checkpoints during Training.	Shengyao Zhuang et al.	SIGIR 2022	Python
Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations.	Jimmy Lin et al.	SIGIR 2021	NA
OpenMatch: An Open Source Library for Neu-IR Research.	Zhenghao Liu et al.	SIGIR 2021	Python
SentEval: An Evaluation Toolkit for Universal Sentence Representations.	Alexis Conneau et al.	Arxiv 2018	Python

RUCAIBox/DenseRetrieval

RUCAIBox

Reviews

Repository Details

Dense Retrieval Papers

Table of Contents

Survey Paper

Architecture

Training

Formulation

Negative Selection

Data Augmentation

Pre-training

Indexing

Interation with Re-ranking

Advanced Topics

Zero-shot Dense Retrieval

Improving the Robustness to Query Variations

Generative Text Retrieval

Retrieval-Augmented Language Model

Applications

Information Retrieval Applications

Natural Language Processing Applications

Industrial Practice

Datasets

Libraries

More Repositories