• Stars
    star
    119
  • Rank 296,128 (Top 6 %)
  • Language
    Python
  • License
    MIT License
  • Created about 7 years ago
  • Updated over 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Official implementation of our ICLR 2018 and SIGIR 2019 papers on Context-aware Neural Information Retrieval

Context-aware Neural Information Retrieval

Introduction

PyTorch code for our ICLR 2018 and SIGIR 2019 papers.

The codebase contains source-code of 8 document ranking models, 3 query suggestions models and 3 multi-task context-aware ranking and suggestion models.

Document Ranking Models
Query Suggestion Models

Please note, we have a simplified implementation of ACG.

Multi-task Learning Models

Requirements

Training/Testing Models

$ cd  scripts
$ bash SCRIPT_NAME GPU_ID MODEL_NAME
  • To train/test document ranking models, use ranker.sh in place of SCRIPT_NAME
  • To train/test query suggestion models, use recommender.sh in place of SCRIPT_NAME
  • To train/test multitask models, use multitask.sh in place of SCRIPT_NAME

Here is a list of models which you can use in place of MODEL_NAME.

  • Document Ranking Models: esm, dssm, cdssm, drmm, arci, arcii, duet, match_tensor
  • Query Suggestion Models: seq2seq, hredqs, acg
  • Multitask Models: mnsrf, m_match_tensor, cars

For example, if you want to run our CARS model, run the following command.

bash multitask.sh GPU_ID cars
Running experiments on CPU/GPU/Multi-GPU
  • If GPU_ID is set to -1, CPU will be used.
  • If GPU_ID is set to one specific number, only one GPU will be used.
  • If GPU_ID is set to multiple numbers (e.g., 0,1,2), then parallel computing will be used.

An Artificial Dataset

We are unable to make our experimental dataset publicly available. However, we are sharing scripts to create an artificial dataset from MSMARCO Q&A v2.1 and MSMARCO Conversational Search datasets. Please run the script by going into the /data/msmarco/ directory. Once the data is generated, you should be able to see a table showing the following statistics.

Attribute Train Dev Test
Sessions 223876 24832 27673
Queries 1530546 169413 189095
Avg Session Len 6.84 6.82 6.83
Avg Query Len 3.84 3.85 3.84
Max Query Len 40 32 32
Avg Doc Len 63.41 63.43 63.48
Max Doc Len 290 290 290
Avg Click Per Query 1.05 1.05 1.05
Max Click Per Query 6 6 6

Results on the Artificial Dataset

Coming soon!

Acknowledgement

I borrowed and modified code from DrQA, OpenNMT. I would like to expresse my gratitdue for authors of these repositeries.

Citation

If you find the resources in this repo useful, please cite our works.

@inproceedings{Ahmad:2019:CAD:3331184.3331246,
 author = {Ahmad, Wasi Uddin and Chang, Kai-Wei and Wang, Hongning},
 title = {Context Attentive Document Ranking and Query Suggestion},
 booktitle = {Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval},
 year = {2019},
 pages = {385--394}
} 
@inproceedings{uddin2018multitask,
 title={Multi-Task Learning for Document Ranking and Query Suggestion},
 author={Wasi Uddin Ahmad and Kai-Wei Chang and Hongning Wang},
 booktitle={International Conference on Learning Representations},
 year={2018}
}

More Repositories

1

paraphrase_identification

Examine two sentences and determine whether they have the same meaning.
HTML
214
star
2

NeuralCodeSum

Official implementation of our work, A Transformer-based Approach for Source Code Summarization [ACL 2020].
Python
191
star
3

PLBART

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].
Python
186
star
4

AVATAR

Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.
Python
52
star
5

GATE

Official implementation of our work, GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and Event Extraction [AAAI 2021].
Python
47
star
6

aol_query_log_analysis

This project aims to analyze different aspects of the AOL query log
Java
25
star
7

transferable_sent2vec

Official code of our work, Robust, Transferable Sentence Representations for Text Classification [Arxiv 2018].
Python
20
star
8

Syntax-MBERT

Official code of our work, Syntax-augmented Multilingual BERT for Cross-lingual Transfer [ACL 2021].
Python
16
star
9

ACE05-Processor

UDPipe based preprocessing of the ACE05 dataset
Python
16
star
10

SumGenToBT

Official code of our work, Summarize and Generate to Back-Translate: Unsupervised Translation of Programming Languages [arXiv].
Python
11
star
11

PolicyQA

Official code of our work, PolicyQA: A Reading Comprehension Dataset for Privacy Policies [Findings of EMNLP 2020].
Python
11
star
12

PolicyIE

Official code of our work, Intent Classification and Slot Filling for Privacy Policies [ACL 2021].
Python
10
star
13

NeuralKpGen

An Empirical Study on Pre-trained Language Models for Neural Keyphrase Generation
Python
8
star
14

cross_lingual_parsing

Official code for our CoNLL 2019 paper on Cross-lingual Dependency Parsing with Unlabeled Auxiliary Languages
Python
7
star
15

intent_aware_privacy_protection_in_pws

Intent-aware Query-obfuscation for Privacy Protection in Personalized Web Search
Java
5
star
16

mining_wikipedia

Extract mentions and category taxonomy from Wikipedia
Java
4
star
17

topic_based_privacy_protection_in_pws

Topic Model based Privacy Protection in Personalized Web Search
Java
4
star
18

PrivacyQA

Unofficial model implementations for the PrivacyQA benchmark (https://github.com/AbhilashaRavichander/PrivacyQA_EMNLP)
Python
3
star
19

Awesome-LLM-Synthetic-Data-Generation

A reading list on LLM based Synthetic Data Generation 🔥
2
star
20

wasiahmad.github.io

My Personal Website
JavaScript
1
star