• Stars
    star
    164
  • Rank 230,032 (Top 5 %)
  • Language
    Python
  • Created over 2 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

[SIGIR 2022] Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval

Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval

This repo contains the annotated datasets and expriments implementation introduced in our resource paper in SIGIR2022 Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval. [Paper].

📢 What's New

  • 🌟 2023-01: Multiple models fine-tuned with Multi-CPR dataset are open source on the ModelScope platform. Released Models 开源模型

Introduction

Multi-CPR is a multi-domain Chinese dataset for passage retrieval. The dataset is collected from three different domains, including E-commerce, Entertainment video and Medical. Each dataset contains millions of passages and a certain amount of human annotated query-passage related pairs.

Examples of annotated query-passage related pairs in three different domains:

Domain Query Passage
E-commerce 尼康z62 (Nikon z62)
Nikon/尼康二代全画幅微单机身Z62 Z72 24-70mm套机 (Nikon/Nikon II, full-frame micro-single camera, body Z62 Z72 24-70mm set)
Entertainment video 海神妈祖 (Ma-tsu, Goddess of the Sea) 海上女神妈祖 (Ma-tsu, Goddess of the Sea)
Medical
大人能把手放在睡觉婴儿胸口吗 (Can adults put their hands on the chest of a sleeping baby?)
大人不能把手放在睡觉婴儿胸口,对孩子呼吸不好,要注意 (Adults should not put their hands on the chest of a sleeping baby as this is not good for the baby's breathing.)

Data Format

Datasets of each domain share a uniform format, more details can be found in our paper:

  • qid: A unique id for each query that is used in evaluation
  • pid: A unique id for each passaage that is used in evaluation
File name number of record format
corpus.tsv 1002822 pid, passage content
train.query.txt 100000 qid, query content
dev.query.txt 1000 qid, query content
qrels.train.tsv 100000 qid, '0', pid, '1'
qrels.dev.tsv 1000 qid, '0', pid, '1'

Experiments

The retrieval and rerank folders contain how to train a BERT-base dense passage retrieval and reranking model based on Multi-CPR dataset. This code is based on the previous work tevatron and reranker produced by luyug. Many thanks to luyug.

Dense Retrieval Resutls

Models Datasets Encoder E-commerce Entertainment video Medical
MRR@10 Recall@1000 MRR@10 Recall@1000 MRR@10 Recall@1000
DPR General BERT 0.2106 0.7750 0.1950 0.7710 0.2133 0.5220
DPR-1 In-domain BERT 0.2704 0.9210 0.2537 0.9340 0.3270 0.7470
DPR-2 In-domain BERT-CT 0.2894 0.9260 0.2627 0.9350 0.3388 0.7690

BERT-reranking results

Retrieval Reranker E-commerce Entertainment video Medical
MRR@10 MRR@10 MRR@10
DPR-1 - 0.2704 0.2537 0.3270
DPR-1 BERT 0.3624 0.3772 0.3885

Requirements

python=3.8
transformers==4.18.0
tqdm==4.49.0
datasets==1.11.0
torch==1.11.0
faiss==1.7.0

Released Models

We have uploaded some checkpoints finetuned with Multi-CPR to ModelScope Model hub. It should be noted that the open-source models on ModelScope are fine-tuned based on the ROM or CoROM model rather than the original BERT model. ROM is a pre-trained language model specially designed for dense passage retrieval task. More details about the ROM model, please refer to paper ROM

Model Type Domain Description Link
Retrieval General - nlp_corom_sentence-embedding_chinese-base
Retrieval E-commerce - nlp_corom_sentence-embedding_chinese-base-ecom
Retrieval Medical - nlp_corom_sentence-embedding_chinese-base-medical
ReRanking General - nlp_rom_passage-ranking_chinese-base
ReRanking E-commerce - nlp_corom_passage-ranking_chinese-base-ecom
ReRanking Medical - nlp_corom_passage-ranking_chinese-base-medical

开源模型

基于Multi-CPR数据集训练的预训练语言模型文本表示(召回)模型、语义相关性(精排)模型已逐步通过ModelScope平台开源,欢迎大家下载体验。在ModelScope上开源的模型都是基于ROM或者CoROM模型为底座训练的而不是原始的BERT模型,ROM是一个专门针对文本召回任务设计的预训练语言模型,更多关于ROM模型细节可以参考论文ROM

模型类别 领域 模型描述 下载链接
Retrieval General 中文通用领域文本表示模型(召回阶段) nlp_corom_sentence-embedding_chinese-base
Retrieval E-commerce 中文电商领域文本表示模型(召回阶段) nlp_corom_sentence-embedding_chinese-base-ecom
Retrieval Medical 中文医疗领域文本表示模型(召回阶段) nlp_corom_sentence-embedding_chinese-base-medical
ReRanking General 中文通用领域语义相关性模型(精排阶段) nlp_rom_passage-ranking_chinese-base
ReRanking E-commerce 中文电商领域语义相关性模型(精排阶段) nlp_corom_passage-ranking_chinese-base-ecom
ReRanking Medical 中文医疗领域语义相关性模型(精排阶段) nlp_corom_passage-ranking_chinese-base-medical

Citing us

If you feel the datasets helpful, please cite:

@inproceedings{Long2022MultiCPRAM,
  author    = {Dingkun Long and Qiong Gao and Kuan Zou and Guangwei Xu and Pengjun Xie and Ruijie Guo and Jian Xu and Guanjun Jiang and Luxi Xing and Ping Yang},
  title     = {Multi-CPR: {A} Multi Domain Chinese Dataset for Passage Retrieval},
  booktitle = {{SIGIR}},
  pages     = {3046--3056},
  publisher = {{ACM}},
  year      = {2022}
}

More Repositories

1

ACE

[ACL-IJCNLP 2021] Automated Concatenation of Embeddings for Structured Prediction
Python
299
star
2

EcomGPT

An Instruction-tuned Large Language Model for E-commerce
Python
221
star
3

HiAGM

Hierarchy-Aware Global Model for Hierarchical Text Classification
Python
206
star
4

SeqGPT

SeqGPT: An Out-of-the-box Large Language Model for Open Domain Sequence Understanding
Python
204
star
5

KB-NER

Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.
Python
177
star
6

CLNER

[ACL-IJCNLP 2021] Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning
Python
91
star
7

MultilangStructureKD

[ACL 2020] Structure-Level Knowledge Distillation For Multilingual Sequence Labeling
Python
71
star
8

MuVER

[EMNLP 2021] MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity Representations
Python
30
star
9

ProtoRE

Code for 'Prototypical Representation Learning for Relation Extraction'.
Python
30
star
10

RankingGPT

code for paper 《RankingGPT: Empowering Large Language Models in Text Ranking with Progressive Enhancement》
Python
28
star
11

DAAT-CWS

Coupling Distant Annotation and Adversarial Training for Cross-Domain Chinese Word Segmentation
Python
22
star
12

AISHELL-NER

[ICASSP 2022] AISHELL-NER: Named Entity Recognition from Chinese Speech
21
star
13

HLATR

Hybrid List Aware Transformer Reranking
18
star
14

AIN

Code for our EMNLP 2020 Paper "AIN: Fast and Accurate Sequence Labeling with Approximate Inference Network"
Python
18
star
15

MANNER

[ACL 2023] MANNER: A Variational Memory-Augmented Model for Cross Domain Few-Shot Named Entity Recognition
Python
17
star
16

EBM-Net

Codes for the EMNLP'2020 paper "Predicting Clinical Trial Results by Implicit Evidence Integration".
Python
14
star
17

CDQA

CDQA: Chinese Dynamic Question Answering Benchmark
Python
13
star
18

StructuralKD

[ACL-IJCNLP 2021] Structural Knowledge Distillation: Tractably Distilling Information for Structured Predictor
Python
9
star
19

MarCo-Dialog

Python
3
star
20

IBKD

This is the official repository for the IBKD knowledge distillation method, as described in the paper .
Python
3
star
21

Vec-RA-ODQA

Source code of paper Improving "Retrieval Augmented Open-Domain Question-Answering with Vectorized Contexts
Python
2
star
22

Key-Point-Analysis

Python
1
star