• Stars
    star
    170
  • Rank 218,058 (Top 5 %)
  • Language
  • Created over 1 year ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Dense Retrieval Papers

A collection of papers related to dense retrieval.

The arrangement of papers refers to our survey "Dense Text Retrieval based on Pretrained Language Models: A Survey".

If you find our survey useful for your research, please cite the following paper:

@article{DRSurvey,
    title={Dense Text Retrieval based on Pretrained Language Models: A Survey},
    author={Wayne Xin Zhao, Jing Liu, Ruiyang Ren, Ji-Rong Wen},
    year={2022},
    journal={arXiv preprint arXiv:2211.14876}
}

Table of Contents

Survey Paper

Paper Author Venue Code
Pretrained Transformers for Text Ranking: BERT and Beyond. Jimmy Lin et al. Synthesis HLT 2021 NA
Semantic Models for the First-stage Retrieval: A Comprehensive Review. Yinqiong Cai et al. Arxiv 2021 NA
Pre-training Methods in Information Retrieval. Yixing Fan et al. Arxiv 2021 NA
A Deep Look into Neural Ranking Models for Information Retrieval. Jiafeng Guo et al. Inf. Process. Manag. 2020 NA
Lecture Notes on Neural Information Retrieval. Nicola Tonellotto. Arxiv 2022 NA
Low-Resource Dense Retrieval for Open-Domain Question Answering: A Comprehensive Survey. Xiaoyu Shen et al. Arxiv 2022 NA

Architecture

Paper Author Venue Code
Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring. Samuel Humeau et al. ICLR 2020 Python
Sparse, Dense, and Attentional Representations for Text Retrieval. Yi Luan et al.
TACL 2021
Python
ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. Omar Khattab et al. SIGIR 2020 Python
Query Embedding Pruning for Dense Retrieval. Nicola Tonellotto et al. CIKM 2021 Python
Context-Aware Term Weighting For First Stage Passage Retrieval. Zhuyun Dai et al. SIGIR 2020 Python
Context-Aware Document Term Weighting for Ad-Hoc Search. Zhuyun Dai et al. WWW 2020 Python
DC-BERT: Decoupling Question and Document for Efficient Contextual Encoding. Yuyu Zhang et al. SIGIR 2020 NA
Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index. Minjoon Seo et al. ACL 2019 Python
Learning Dense Representations of Phrases at Scale. Jinhyuk Lee et al. ACL 2021 Python
Phrase Retrieval Learns Passage Retrieval, Too. Jinhyuk Lee et al.
EMNLP 2021
Python
Dense Hierarchical Retrieval for Open-Domain Question Answering. Ye Liu et al. EMNLP 2021 Python
The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes. Nils Reimers et al. ACL 2021 NA
Predicting Efficiency/Effectiveness Trade-offs for Dense vs. Sparse Retrieval Strategy Selection. Negar Arabzadeh et al. CIKM 2021 Python
Boosted Dense Retriever. Patrick Lewis et al. Arxiv 2021 NA
PARM: A Paragraph Aggregation Retrieval Model for Dense Document-to-Document Retrieval. Sophia Althammer et al. ECIR 2022 Python
Sparsifying Sparse Representations for Passage Retrieval by Top-k Masking. Jheng-Hong Yang et al. Arxiv 2021 NA
Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval. Hongyin Tang et al. ACL 2021 NA
ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. Keshav Santhanam et al. Arxiv 2021 Python
GNN-encoder: Learning a Dual-encoder Architecture via Graph Neural Networks for Dense Passage Retrieval. Jiduan Liu et al. Arxiv 2022 NA
Sentence-aware Contrastive Learning for Open-Domain Passage Retrieval. Bohong Wu et al. ACL 2022 Python
Aggretriever: A Simple Approach to Aggregate Textual Representation for Robust Dense Passage Retrieval. Sheng-Chieh Lin et al. Arxiv 2022 Python
DPTDR: Deep Prompt Tuning for Dense Passage Retrieval. Zhengyang Tang et al. Arxiv 2022 Python
LED: Lexicon-Enlightened Dense Retriever for Large-Scale Retrieval. Kai Zhang et al. Arxiv 2022 NA
Task-Aware Specialization for Efficient and Robust Dense Retrieval for Open-Domain Question Answering. Hao Cheng et al. Arxiv 2022 NA
COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. Luyu Gao et al. NAACL 2021 Python
A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques. Jimmy Lin et al. Arxiv 2021 NA
Pseudo Relevance Feedback with Deep Language Models and Dense Retrievers: Successes and Pitfalls. Hang Li et al. Arxiv 2021 NA
Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback. HongChien Yu et al. CIKM 2021 Python
Pseudo-Relevance Feedback for Multiple Representation Dense Retrieval. Xiao Wang et al. SIGIR 2021 NA
Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback: A Reproducibility Study. Hang Li et al. Arxiv 2021 NA
Implicit Feedback for Dense Passage Retrieval: A Counterfactual Approach. Shengyao Zhuang et al. Arxiv 2022 Python
Parameter-Efficient Prompt Tuning Makes Generalized and Calibrated Neural Text Retrievers. Weng Lam Tam et al. Arxiv 2022 Python
Densifying Sparse Representations for Passage Retrieval by Representational Slicing. Sheng-Chieh Lin et al. Arxiv 2021 NA
SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. Thibault Formal et al. SIGIR 2021 Python
SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval. Thibault Formal et al. Arxiv 2021 Python
BERT-based Dense Retrievers Require Interpolation with BM25 for Effective Passage Retrieval. Shuai Wang et al. ICTIR 2021 Python
A White Box Analysis of ColBERT. Thibault Formal et al. ECIR 2021 NA
Towards Axiomatic Explanations for Neural Ranking Models. Michael Völske et al. ICTIR 2021 Python
ABNIRML: Analyzing the Behavior of Neural IR Models. Sean MacAvaney et al. Arxiv 2020 Python

Training

Formulation

Paper Author Venue Code
More Robust Dense Retrieval with Contrastive Dual Learning. Yizhi Li et al. ICTIR 2021 Python
PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval. Ruiyang Ren et al. ACL 2021 Python
xMoCo: Cross Momentum Contrastive Learning for Open-Domain Question Answering. Nan Yang et al. ACL 2021 NA
A Modern Perspective on Query Likelihood with Deep Generative Retrieval Models. Oleg Lesota et al. ICTIR 2021 Python
Learning Diverse Document Representations with Deep Query Interactions for Dense Retrieval. Zehan Li et al. Arxiv 2022 Python
Shallow pooling for sparse labels. Negar Arabzadeh et al. Arxiv 2021 NA
Hard Negatives or False Negatives: Correcting Pooling Bias in Training Neural Ranking Models. Yinqiong Cai et al. Arxiv 2022 NA
Debiased Contrastive Learning of Unsupervised Sentence Representations. Kun Zhou et al. ACL 2022 NA

Negative Selection

Paper Author Venue Code
Learning To Retrieve: How to Train a Dense Retrieval Model Effectively and Efficiently. Jingtao Zhan et al. Arxiv 2020 NA
Dense Passage Retrieval for Open-Domain Question Answering. Vladimir Karpukhin et al. EMNLP 2020 Python
RepBERT: Contextualized Text Embeddings for First-Stage Retrieval. Jingtao Zhan et al. Arxiv 2020 Python
Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. Lee Xiong et al. ICLR 2021 Python
Optimizing Dense Retrieval Model Training with Hard Negatives. Jingtao Zhan et al. SIGIR 2021 Python
Neural Passage Retrieval with Improved Negative Contrast. Jing Lu et al. Arxiv 2020 NA
RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. Yingqi Qu et al. NAACL 2021 Python
Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. Sebastian Hofstätter et al. SIGIR 2021 Python
Scaling deep contrastive learning batch size under memory limited setup. Luyu Gao et al. RepL4NLP 2021 Python
Multi-stage training with improved negative contrast for neural passage retrieval. Jing Lu et al. EMNLP 2021 NA
Learning robust dense retrieval models from incomplete relevance labels. Prafull Prakash et al. SIGIR 2021 Python
Efficient Training of Retrieval Models Using Negative Cache. Erik M. Lindgren et al. NeurIPS 2021 Python
CODER: An efficient framework for improving retrieval through COntextual Document Embedding Reranking. George Zerveas et al. Arxiv 2021 NA
Curriculum Learning for Dense Retrieval Distillation. Hansi Zeng et al. SIGIR 2022 Python
SimANS: Simple Ambiguous Negatives Sampling for Dense Text Retrieval. Kun Zhou et al. EMNLP 2022 Python

Data Augmentation

Paper Author Venue Code
UniK-QA: Unified Representations of Structured and Unstructured Knowledge for Open-Domain Question Answering. Barlas Oguz et al. Arxiv 2021 NA
Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks. Nandan Thakur et al. NAACL 2021 Python
Is Retriever Merely an Approximator of Reader? Sohee Yang et al. Arxiv 2020 NA
Distilling Knowledge from Reader to Retriever for Question Answering. Gautier Izacard et al. ICLR 2021 Python
Distilling Knowledge for Fast Retrieval-based Chat-bots. Amir Vakili Tahami et al. SIGIR 2020 Python
Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation. Sebastian Hofstätter et al. Arxiv 2020 Python
Distilling Dense Representations for Ranking using Tightly-Coupled Teachers. Sheng-Chieh Lin et al. Arxiv 2020 Python
In-Batch Negatives for Knowledge Distillation with Tightly-Coupled Teachers for Dense Retrieval. Sheng-Chieh Lin et al. RepL4NLP 2021 Python
Neural Retrieval for Question Answering with Cross-Attention Supervised Data Augmentation. Yinfei Yang et al. ACL 2021 NA
Enhancing Dual-Encoders with Question and Answer Cross-Embeddings for Answer Retrieval. Yanmeng Wang et al. EMNLP 2021 NA
Pseudo Label based Contrastive Sampling for Long Text Retrieval. Le Zhu et al. IALP 2021 NA
Multi-View Document Representation Learning for Open-Domain Dense Retrieval. Shunyu Zhang et al. ACL 2022 NA
Augmenting Document Representations for Dense Retrieval with Interpolation and Perturbation. Soyeong Jeong et al. ACL 2022 Python
ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder via Self On-the-fly Distillation for Dense Passage Retrieval. Yuxiang Lu et al. Arxiv 2022 NA
Pro-KD: Progressive Distillation by Following the Footsteps of the Teacher. Mehdi Rezagholizadeh et al. COLING 2022 NA
Questions Are All You Need to Train a Dense Passage Retriever. Devendra Singh Sachan et al. Arxiv 2022 Python
PROD: Progressive Distillation for Dense Retrieval. Zhenghao Lin et al. Arxiv 2022 NA
Answering Open-Domain Questions of Varying Reasoning Steps from Text. Peng Qi et al. EMNLP 2021 Python
Multi-Task Retrieval for Knowledge-Intensive Tasks. Jean Maillard et al. ACL 2021 NA

Pre-training

Paper Author Venue Code
Latent Retrieval for Weakly Supervised Open Domain Question Answering. Kenton Lee et al. ACL 2019 Python
Pre-training tasks for embedding-based large scale retrieval. Wei-Cheng Chang et al. ICLR 2020 NA
PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval. Xinyu Ma et al. WSDM 2021 Python
B-PROP: Bootstrapped Pre-training with Representative Words Prediction for Ad-hoc Retrieval. Xinyu Ma et al. SIGIR 2021 NA
Domain-matched Pre-training Tasks for Dense Retrieval. Barlas Oguz et al. Arxiv 2021 NA
Less is More: Pre-train a Strong Text Encoder for Dense Retrieval Using a Weak Decoder. Shuqi Lu et al. EMNLP 2021 Python
Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models. Jianmo Ni et al. Arxiv 2021 Python
Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval. Luyu Gao et al. ACL 2022 Python
Condenser: a Pre-training Architecture for Dense Retrieval. Luyu Gao et al. EMNLP 2021 Python
TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning. Kexin Wang et al. EMNLP 2021 Python
SimCSE: Simple Contrastive Learning of Sentence Embeddings. Tianyu Gao et al. EMNLP 2021 Python
Towards Robust Neural Retrieval Models with Synthetic Pre-Training. Revanth Gangi Reddy et al. Arxiv 2021 NA
Hyperlink-induced Pre-training for Passage Retrieval in Open-domain Question Answering. Jiawei Zhou et al. ACL 2022 Python
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Nils Reimer et al. EMNLP 2019 Python
Introducing Neural Bag of Whole-Words with ColBERTer: Contextualized Late Interactions using Enhanced Reduction. Sebastian Hofstätter et al. Arxiv 2022 Python
Learning to Retrieve Passages without Supervision. Ori Ram et al. Arxiv 2021 Python
Text and Code Embeddings by Contrastive Pre-Training. Arvind Neelakantan et al. Arxiv 2022 NA
Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction. Xinyu Ma et al. SIGIR 2022 Python
RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder. Shitao Xiao et al. CoRR 2022
SIMLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval. Liang Wang et al. CoRR 2022 Python
Masked Autoencoders As The Unified Learners For Pre-Trained Sentence Representation. Alexander Liu et al. CoRR 2022 NA
LEXMAE: Lexicon-BottleNecked Pretraining fot Large-scale Retrieval. Tao Shen et al. Arxiv 2022 NA
PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them. Patrick Lewis et al. Arxiv 2021 Python
End-to-End Synthetic Data Generation for Domain Adaptation of Question Answering Systems. Siamak Shakeri et al. EMNLP 2020 NA
ConTextual Mask Auto-Encoder for Dense Passage Retrieval. Xing Wu et al. Arxiv 2022 Python
A Contrastive Pre-training Approach to Learn Discriminative Autoencoder for Dense Retrieval. Xinyu Ma et al. Arxiv 2022 NA

Indexing

Paper Author Venue Code
Learning Passage Impacts for Inverted Indexes. Antonio Mallia et al. SIGIR 2021 Python
Accelerating Large-Scale Inference with Anisotropic Vector Quantization. Ruiqi Guo et al. Arxiv 2019 Python
Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance. Jingtao Zhan et al. CIKM 2021 Python
Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval. Jingtao Zhan et al. WSDM 2022 Python
Joint Learning of Deep Retrieval Model and Product Quantization based Embedding Index. Han Zhang et al. SIGIR 2021 Python
Efficient Passage Retrieval with Hashing for Open-domain Question Answering. Ikuya Yamada et al. ACL 2021 Python
A Memory Efficient Baseline for Open Domain Question Answering. Gautier Izacard et al. Arxiv 2020 NA
Simple and Effective Unsupervised Redundancy Elimination to Compress Dense Vectors for Passage Retrieval. Xueguang Ma et al. EMNLP 2021 Python
The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes. Nils Reimers et al. ACL 2021 NA
Matching-oriented Product Quantization For Ad-hoc Retrieval. Shitao Xiao et al. EMNLP 2021 Python
Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval. Shitao Xiao et al. WWW 2022 NA
Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS). Anshumali Shrivastava et al. NeuraIPS 2014 NA
ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms. Martin Aumüller et al. SISAP 2017 NA
Results of the NeurIPS’21 Challenge on Billion-Scale Approximate Nearest Neighbor Search. Harsha Vardhan Simhadri et al. Arxiv 2022 Python
Interpreting Dense Retrieval as Mixture of Topics. Jingtao Zhan et al. Arxiv 2021 NA
The Web Is Your Oyster - Knowledge-Intensive NLP against a Very Large Web Corpus. Aleksandra Piktus et al. CoRR 2021 NA
Bi-Phase Enhanced IVFPQ for Time-Efficient Ad-hoc Retrieval. Peitian Zhang et al. Arxiv 2022 NA

Interation with Re-ranking

Paper Author Venue Code
RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking. Ruiyang Ren et al. EMNLP 2021 Python
Dealing with Typos for BERT-based Passage Retrieval and Ranking. Shengyao Zhuang et al. EMNLP 2021 Python
Trans-Encoder: Unsupervised sentence-pair modelling through self- and mutual-distillations. Fangyu Liu et al. ICLR 2022 Python
Adversarial Retriever-Ranker for dense text retrieval. Hang Zhang et al. Arxiv 2021 NA
Embedding-based Retrieval in Facebook Search. Jui-Ting Huang et al. KDD 2020 NA
Passage Re-ranking With BERT. Rodrigo Nogueira et al. Arxiv 2019 NA
Understanding the Behaviors of BERT in Ranking. Yifan Qiao et al. CoRR 2019 NA
Multi-passage BERT: A Globally Normalized BERT Model for Open-domain Question Answering. Zhiguo Wang et al. Arxiv 2019 NA
TOWARDS ROBUST RANKER FOR TEXT RETRIEVAL. Yucheng Zhou et al. Arxiv 2022 Python
Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline. Luyu Gao et al. ECIR 2021 NA
Multi-Stage Document Ranking with BERT. Rodrigo Nogueira et al. CoRR 2019 NA

Advanced Topics

Zero-shot Dense Retrieval

Paper Author Venue Code
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. Nandan Thakur et al. NIPS 2021 Python
A Thorough Examination on Zero-shot Dense Retrieval. Ruiyang Ren et al. Arxiv 2022 NA
Challenges in Generalization in Open Domain Question Answering. Linqing Liu et al. NAACL 2022 Python
Zero-shot Neural Passage Retrieval via Domain-targeted Synthetic Question Generation. Ji Ma et al. Arxiv 2021 NA
Efficient Retrieval Optimized Multi-task Learning. Hengxin Fun et al. Arxiv 2021 NA
Zero-Shot Dense Retrieval with Momentum Adversarial Domain Invariant Representations. Ji Xin et al. ACL 2022 NA
Towards Robust Neural Retrieval Models with Synthetic Pre-Training. Revanth Gangi Reddy et al. Arxiv 2021 NA
Embedding-based Zero-shot Retrieval through Query Generation. Davis Liang et al. Arxiv 2020 NA
GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval. Kexin Wang et al. Arxiv 2021 Python
Salient Phrase Aware Dense Retrieval: Can a Dense Retriever Imitate a Sparse One? Xilun Chen et al. Arxiv 2021 Python
LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval. Canwen Xu et al. ACL 2022 Python
Out-of-Domain Semantics to the Rescue! Zero-Shot Hybrid Retrieval Models. Tao Chen et al. ECIR 2022 NA
Towards Unsupervised Dense Information Retrieval with Contrastive Learning. Gautier Izacard et al. Arxiv 2021 NA
Large Dual Encoders Are Generalizable Retrievers. Jianmo Ni et al. Arxiv 2021 NA
KILT: a Benchmark for Knowledge Intensive Language Tasks. Fabio Petroni et al. Arxiv 2020 Python
Promptagator: Few-shot Dense Retrieval From 8 Examples. Zhuyun Dai et al. Arxiv 2022 NA

Improving the Robustness to Query Variations

Paper Author Venue Code
Towards Robust Dense Retrieval via Local Ranking Alignment. Xuanang Chen et al. IJCAI 2022 Python
CharacterBERT and Self-Teaching for Improving the Robustness of Dense Retrievers on Queries with Typos. Shengyao Zhuang et al. SIGIR 2022 Python
Evaluating the Robustness of Retrieval Pipelines with Query Variation Generators. Gustavo Penha et al. ECIR 2022 Python
Retrieval Consistency in the Presence of Query Variations. Peter Bailey et al. SIGIR 2017 NA
Analysing the Robustness of Dual Encoders for Dense Retrieval Against Misspellings. Peter Bailey et al. Arxiv 2022 Shell
A Survey of Automatic Query Expansion in Information Retrieval. Claudio Carpineto et al. CSUR 2012 NA
BERT Rankers are Brittle: a Study using Adversarial Document Perturbations. Yumeng Wang et al. SIGIR 2022 Python
Order-Disorder: Imitation Adversarial Attacks for Black-box Neural Ranking Models. Jiawei Liu et al. CoRR 2022 NA

Generative Text Retrieval

Paper Author Venue Code
Transformer Memory as a Differentiable Search Index. Yi Tay et al. Arxiv 2022 NA
DynamicRetriever: A Pre-training Model-based IR System with Neither Sparse nor Dense Index. Yujia Zhou et al. Arxiv 2022 NA
Autoregressive Search Engines: Generating Substrings as Document Identifiers. Michele Bevilacqua et al. Arxiv 2022 Python
Generative Retrieval for Long Sequences. Hyunji Lee et al. Arxiv 2022 NA
GERE: Generative Evidence Retrieval for Fact Verification. Jiangui Chen et al. SIGIR 2022 Python
Autoregressive Entity Retrieval. Nicola De Cao et al. ICLR 2021 Python
Rethinking Search: Making Domain Experts out of Dilettantes. Donald Metzler et al. SIGIR 2021 NA
Transformer Memory as a Differentiable Search Index. Yi Tay et al. Arxiv 2022 NA
A Neural Corpus Indexer for Document Retrieval. Yujing Wang et al. CoRR 2022 NA
Bridging the Gap Between Indexing and Retrieval for Differentiable Search Index with Query Generation. Shengyao Zhuang et al. CoRR 2022 Python
Ultron: An Ultimate Retriever on Corpus with a Model-based Indexer. Yujia Zhou et al. CoRR 2022 NA
CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks. Jiangui Chen et al. Arxiv 2022 Python

Retrieval-Augmented Language Model

Paper Author Venue Code
Generalization through memorization: Nearest neighbor language models. Urvashi Khandelwa et al. Arxiv 2020 Python
Adaptive semiparametric language models. Dani Yogatama et al. TACL 2021 NA
Improving language models by retrieving from trillions of tokens. Borgeaud, Sebastian, et al. Arxiv 2021 NA
REALM: Retrieval-Augmented Language Model Pre-Training. Kelvin Guu et al. ICML 2020 Python
Simple and Efficient ways to Improve REALM. Vidhisha Balachandran et al. Arxiv 2021 NA
Adaptive Semiparametric Language Models. Dani Yogatama et al. TACL 2021 NA
Efficient Nearest Neighbor Language Models. Junxian He et al. EMNLP 2021 Python

Applications

Information Retrieval Applications

Paper Author Venue Code
Multi-modal Retrieval of Tables and Texts Using Tri-encoder Models. Bogdan Kostic et al. Arxiv 2021 NA
Open Domain Question Answering over Tables via Dense Retrieval. Jonathan Herzig et al. NAACL 2021 Python
SituatedQA: Incorporating Extra-Linguistic Contexts into QA. Michael J.Q. Zhang et al. EMNLP 2021 DATA
XOR QA: Cross-lingual Open-Retrieval Question Answering. Akari Asai et al. NAACL 2021 Python
One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval. Akari Asai et al. NeurIPS 2021 Python
Evaluating Token-Level and Passage-Level Dense Retrieval Models for Math Information Retrieval. Wei Zhong et al. Arxiv 2022 Python
ReACC: A Retrieval-Augmented Code Completion Framework. Shuai Lu et al. ACL 2022 Python
Improving Biomedical Information Retrieval with Neural Retrievers. Man Luo et al. AAAI 2022 NA

Natural Language Processing Applications

Paper Author Venue Code
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Patrick Lewis et al. Arxiv 2020 NA
Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. Gautier Izacard et al. ECAL 2021 Python
End-to-End Training of Neural Retrievers for Open-Domain Question Answering. Devendra Singh Sachan et al. ACL 2021 Python
Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval. Omar Khattab et al. NeurIPS 2021 Python
Answering Complex Open-domain Questions with Multi-hop Dense Retrieval. Wenhan Xiong et al. ICLR 2021 Python
Learning Dense Representations for Entity Retrieval. Daniel Gillick et al. CoNLL 2019 NA
Scalable Zero-shot Entity Linking with Dense Entity Retrieval. Ledell Wu et al. EMNLP 2020 Python
Zero-Shot Entity Linking by Reading Entity Descriptions. Lajanugen Logeswaran et al. ACL 2019 Python
Retrieval Augmentation Reduces Hallucination in Conversation. Kurt Shuster et al. EMNLP 2021 NA
Internet-Augmented Dialogue Generation. Mojtaba Komeili et al. ACL 2022 NA
LaMDA: Language Models for Dialog Applications. Romal Thoppilan et al. Arxiv 2022 NA

Industrial Practice

Paper Author Venue Code
Pre-trained Language Model for Web-scale Retrieval in Baidu Search. Yiding Liu et al. KDD 2021 NA
MOBIUS: Towards the Next Generation of Query-Ad Matching in Baidu’s Sponsored Search. Miao Fan. KDD 2019 NA
Uni-Retriever: Towards Learning The Unified Embedding Based Retriever in Bing Sponsored Search. Jianjin Zhang et al. Arxiv 2022 NA
Embedding-based Product Retrieval in Taobao Search. Sen Li et al. KDD 2021 NA
Que2Search: Fast and Accurate Query and Document Understanding for Search at Facebook. Yiqun Liu et al. KDD 2021 NA
DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node. Suhas Jayaram Subramanya et al. NeurIPS 2019 Python
SPANN: Highly-efficient Billion-scale Approximate Nearest Neighbor Search. Qi Chen et al. NeurIPS 2021 Python
HEARTS: Multi-task Fusion of Dense Retrieval and Non-autoregressive Generation for Sponsored Search. Bhargav Dodla et al. Arxiv 2022 NA
Sponsored Search Auctions: Recent Advances and Future Directions. Tao Qin et al. TIST 2015 NA
Semantic Retrieval at Walmart. Alessandro Magnani et al. KDD 2022 NA

Datasets

Paper Author Venue Link
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. Nandan Thakur et al. NeurIPS 2021 DATA
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. Payal Bajaj et al. NeurIPS 2016 DATA
Natural Questions: a Benchmark for Question Answering Research. Tom Kwiatkowski et al. TACL 2019 DATA
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Mandar Joshi et al. ACL 2017 DATA
mMARCO: A Multilingual Version of the MS MARCO Passage Ranking Dataset. Luiz Henrique Bonifacio et al. Arxiv 2021 DATA
TREC 2019 News Track Overview. Ian Soborof et al. TREC 2019 DATA
TREC-COVID: rationale and structure of an information retrieval shared task for COVID-19. Kirk Roberts et al. J Am Med Inform Assoc. 2020 DATA
A Full-Text Learning to Rank Dataset for Medical Information Retrieval. Vera Boteva et al. ECIR 2016 DATA
A Data Collection for Evaluating the Retrieval of Related Tweets to News Articles. Axel Suarez et al. ECIR 2018 DATA
Overview of Touché 2020: Argument Retrieval. Alexander Bondarenko et al. CLEF 2020 DATA
Retrieval of the Best Counterargument without Prior Topic Knowledge. Henning Wachsmuth et al. ACL 2018 DATA
DBpedia-Entity v2: A Test Collection for Entity Search. Faegheh Hasibi et al. SIGIR 2017 DATA
ORCAS: 20 Million Clicked Query-Document Pairs for Analyzing Search. Nick Craswell et al. CIKM 2020 DATA
TREC 2022 Deep Learning Track Guidelines. Nick Craswell et al. TREC 2021 DATA
DuReader_retrieval: A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine. Yifu Qiu et al. Arxiv 2022 DATA
SQuAD: 100,000+ Questions for Machine Comprehension of Text. Pranav Rajpurkar et al. EMNLP 2016 DATA
HOTPOTQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. Zhilin Yang et al. EMNLP 2018 DATA
Semantic Parsing on Freebase from Question-Answer Pairs. Jonathan Berant et al. EMNLP 2013 DATA
Modeling of the Question Answering Task in the YodaQA System. Petr Baudiš et al. CLEF 2015 DATA
WWW'18 Open Challenge: Financial Opinion Mining and Question Answering. Macedo Maia et al. WWW 2018 DATA
An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. George Tsatsaronis et al. BMC Bioinform. 2015 DATA
CQADupStack: A Benchmark Data Set for Community Question-Answering Research. Doris Hoogeveen et al. ADCS 2015 DATA
First Quora Dataset Release: Question Pairs. Shankar Iyer et al. Webpage DATA
CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training. Patrick Huber et al. NAACL 2022 DATA
FEVER: a Large-scale Dataset for Fact Extraction and VERification. James Thorne et al. NAACL 2018 DATA
CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims. Thomas Diggelmann et al. NeurIPS 2020 DATA
Fact or Fiction: Verifying Scientific Claims. David Wadden et al. EMNLP 2020 DATA
SPECTER: Document-level Representation Learning using Citation-informed Transformers. Arman Cohan et al. ACL 2020 DATA
Simple Entity-Centric Questions Challenge Dense Retrievers. Christopher Sciavolino et al. EMNLP 2021 DATA
ArchivalQA: A Large-scale Benchmark Dataset for Open Domain Question Answering over Archival News Collections. Jiexin Wang et al. Arxiv 2021 NA
Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval. Dingkun Long et al. SIGIR 2022 DATA
HOVER: A Dataset for Many-Hop Fact Extraction And Claim Verification. Yichen Jiang et al. EMNLP 2020 DATA
TREC 2021 Deep Learning Track Guidelines. Nick Craswell et al. NA NA
MSMarco Chameleons: Challenging the MSMarco Leaderboard with Extremely Obstinate Queries. Negar Arabzadeh et al. CIKM 2021 Roff

Libraries

Paper Author Venue Code
RocketQA --- webpage Python
Billion-scale similarity search with GPUs. Jeff Johnson et al. TBD 2019 Python
Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations. Jimmy Lin et al. Arxiv 2021 Python
MatchZoo: A Learning, Practicing, and Developing System for Neural Text Matching. Jiafeng Guo et al. SIGIR 2019 Python
Anserini: Enabling the Use of Lucene for Information Retrieval Research. Peilin Yang et al. SIGIR 2017 Java
Tevatron: An Efficient and Flexible Toolkit for Dense Retrieval. Luyu Gao et al. Arxiv 2022 Python
Asyncval: A Toolkit for Asynchronously Validating Dense Retriever Checkpoints during Training. Shengyao Zhuang et al. SIGIR 2022 Python
Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations. Jimmy Lin et al. SIGIR 2021 NA
OpenMatch: An Open Source Library for Neu-IR Research. Zhenghao Liu et al. SIGIR 2021 Python
SentEval: An Evaluation Toolkit for Universal Sentence Representations. Alexis Conneau et al. Arxiv 2018 Python

More Repositories

1

LLMSurvey

The official GitHub page for the survey paper "A Survey of Large Language Models".
Python
8,693
star
2

RecBole

A unified, comprehensive and efficient recommendation library
Python
3,241
star
3

TextBox

TextBox 2.0 is a text generation library with pre-trained language models
Python
1,055
star
4

Awesome-RSPapers

Recommender System Papers
902
star
5

RecSysDatasets

This is a repository of public data sources for Recommender Systems (RS).
Python
731
star
6

CRSLab

CRSLab is an open-source toolkit for building Conversational Recommender System (CRS).
Python
474
star
7

Top-conference-paper-list

A collection of classified and organized top conference paper list.
362
star
8

HaluEval

This is the repository of HaluEval, a large-scale hallucination evaluation benchmark for Large Language Models.
Python
298
star
9

LLMRank

[ECIR'24] Implementation of "Large Language Models are Zero-Shot Rankers for Recommender Systems"
Python
182
star
10

Negative-Sampling-Paper

This repository collects 100 papers related to negative sampling methods.
173
star
11

RecBole2.0

An up-to-date, comprehensive and flexible recommendation library
167
star
12

UniSRec

[KDD'22] Official PyTorch implementation for "Towards Universal Sequence Representation Learning for Recommender Systems".
Python
158
star
13

RecBole-GNN

Efficient and extensible GNNs enhanced recommender library based on RecBole.
Python
154
star
14

LLMBox

Python
117
star
15

NCL

[WWW'22] Official PyTorch implementation for "Improving Graph Collaborative Filtering with Neighborhood-enriched Contrastive Learning".
Python
113
star
16

RSPapers

Must-read papers on Recommender System. 推荐系统相关论文整理(内含40篇论文,并持续更新中)
89
star
17

RecBole-CDR

This is a library built upon RecBole for cross-domain recommendation algorithms
Python
78
star
18

MVP

This repository is the official implementation of our paper MVP: Multi-task Supervised Pre-training for Natural Language Generation.
67
star
19

VQ-Rec

[WWW'23] PyTorch implementation for "Learning Vector-Quantized Item Representation for Transferable Sequential Recommenders".
Python
51
star
20

RecBole-PJF

Python
46
star
21

ChatCoT

The official repository of "ChatCoT: Tool-Augmented Chain-of-Thought Reasoning on Chat-based Large Language Models"
Python
41
star
22

CORE

[SIGIR'22] Official PyTorch implementation for "CORE: Simple and Effective Session-based Recommendation within Consistent Representation Space".
Python
37
star
23

Multi-View-Co-Teaching

Code for our CIKM 2020 paper "Learning to Match Jobs with Resumes from Sparse Interaction Data using Multi-View Co-Teaching Network"
Python
29
star
24

JiuZhang

Our code will be public soon .
Python
25
star
25

ELMER

This repository is the official implementation of our EMNLP 2022 paper ELMER: A Non-Autoregressive Pre-trained Language Model for Efficient and Effective Text Generation
Python
24
star
26

BAMBOO

Python
23
star
27

Language-Specific-Neurons

Python
17
star
28

RecBole-DA

Python
17
star
29

CARP

Python
16
star
30

SAFE

The pytorch implementation of the SAFE model presented in NAACL-Findings-2022
Python
16
star
31

RecBole-TRM

Python
13
star
32

Erya

12
star
33

MML

Python
12
star
34

Context-Tuning

This is the repository for COLING 2022 paper "Context-Tuning: Learning Contextualized Prompts for Natural Language Generation".
11
star
35

UniWeb

The official repository for our ACL 2023 Findings paper: The Web Can Be Your Oyster for Improving Language Models
9
star
36

PPGM

[ICDM'22] PyTorch implementation for "Privacy-Preserved Neural Graph Similarity Learning".
Python
6
star
37

LIVE

The official repository our ACL 2023 paper: "Learning to Imagine: Visually-Augmented Natural Language Generation"."
Python
5
star
38

Social-Datasets

A collection of social datasets for RecBole-GNN.
5
star
39

M3SRec

4
star
40

FIGA

Python
3
star
41

Contrastive-Curriculum-Learning

Python
3
star
42

Data-CUBE

3
star
43

Div-Ref

The official repository of "Not All Metrics Are Guilty: Improving NLG Evaluation Diversifying References".
Python
2
star
44

GenRec

Python
1
star
45

ETRec

Python
1
star