• Stars
    star
    145
  • Rank 254,144 (Top 6 %)
  • Language
    Python
  • Created about 4 years ago
  • Updated about 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

OCNLI: 中文原版自然语言推理任务

OCNLI: Original Chinese Natural Language Inference

OCNLI stands for Original Chinese Natural Language Inference. It is corpus for Chinese Natural Language Inference, collected following closely the procedures of MNLI, but with enhanced strategies aiming for more challenging inference pairs. We want to emphasize we did not use human/machine translation in creating the dataset, and thus our Chinese texts are original and not translated.

OCNLI has roughly 50k pairs for training, 3k for development and 3k for test. We only release the test data but not its labels. See our paper for details.

OCNLI is part of the CLUE benchmark.

OCNLI,即原生中文自然语言推理数据集,是第一个非翻译的、使用原生汉语的大型中文自然语言推理数据集。 OCNLI包含5万余训练数据,3千验证数据及3千测试数据。除测试数据外,我们将提供数据及标签。测试数据仅提供数据。OCNLI为中文语言理解基准测评(CLUE)的一部分。更多细节请参考我们的论文

Data format 数据格式

Our dataset is distributed in json format. Here's an example from OCNLI.dev:

{
"level":"medium",
"sentence1":"身上裹一件工厂发的棉大衣,手插在袖筒里",
"sentence2":"身上至少一件衣服",
"label":"entailment","label0":"entailment","label1":"entailment","label2":"entailment","label3":"entailment","label4":"entailment",
"genre":"lit","prem_id":"lit_635","id":0
}

where:

  • level: easy, medium and hard refer to the 1st, 2nd and 3rd hypothesis the annotator wrote respectively for that premise and label. See our paper for details. 【难度】: easy, medium, hard分别代表标注人员为某一标签(如entailment)写的第一、第二、第三个假设。我们预计三者难度递增。具体数据收集方式请参考论文。
  • sentence1: the premise sentence(s) 【句子1】,即前提。
  • sentence2: the hypothesis sentence(s) 【句子2】,即假设。
  • label: majority vote from label0 -- label4. If no majority agreement, the label will be -, and this example should be excluded in experiments, same as in SNLI and MNLI (already taken care of in our baseline code) 【标签】,即标签0 -- 标签4的majority vote。如果标签为'-',则此数据应除去,因为5名标注人员没有得出共识,此项设置与SNLI/MNLI相同。
  • label0 -- label4: 5 annotated labels for the NLI pair. All pairs in dev and test have 5 labels, whereas only a small portion in the training set has 5 labels 【5个标签】,验证集与测试集的数据均有5个标签。训练集仅部分数据有5个标签。
  • genre: one of gov, news, lit, tv and phone 【文本类别】,共5类:政府公报、新闻、文学、电视谈话节目、电话转写。
  • prem_id: id for the premise 【前提编号】
  • id: overall id 【总编号】

You will only need sentence1, sentence2 and label to train and evaluate. 训练和验证时仅需【sentence1】【sentence2】【label】

Data split 数据集切分

为了了解训练集大小对结果的影响,我们提供四个大小不同的训练集。OCNLI.train.3k, OCNLI.train.10k, OCNLI.train.30k均为 OCNLI.train.50k的子集。四种情况下的验证和测试集相同,均为OCNLI.dev, OCNLI.test.

We provide four training sets:

  1. OCNLI.train.50k: 50k data points (OCNLI.train in our paper)
  2. OCNLI.train.30k: filtered subset of OCNLI.train.50k with 30k data points (OCNLI.train.small in our paper)
  3. OCNLI.train.10k: 10k data points sampled from OCNLI.train.30k
  4. OCNLI.train.3k: 3k data points sampled from OCNLI.train.30k

We wanted to see the effect of training size and overlapping premises on the results. The results trained with the first two training sets are reported in our paper, along with the details about the splits. The last two sets are intended to mimic situations where annotated data are limited.

All training sets should be validated on the same dev and test sets.

Leaderboard 排行榜

OCNLI is part of the CLUE benchmark, which will hold a leaderboard here. You can submit your results on the test set there.

目前可以提交用OCNLI.train.50kOCNLI.train.30k训练后的测试结果。

注:提交格式:提交一个zip压缩包。里面需要包含如下文件: OCNLI_50k.json, OCNLI_30k.json

Baselines 基线模型及结果

Models

Please refer to https://github.com/CLUEbenchmark/OCNLI/blob/main/rep_baseline.md

Results

  • trained with OCNLI.train = 50k data points

Accuracy on dev / test sets: mean accuracy across 5 runs (standard deviation). BERT: BERT_base, RoBERTa: RoBERTa_large_wwm. Check more details on the paper.

validation data majority CBOW BiLSTM ESIM BERT RoBERTa human
dev 37.4 56.8 (0.4) 60.5 (0.4) 61.8 (0.5) 74.5 (0.3) 78.8 (1.0) na
test 38.1 55.7 (0.5) 59.2 (0.5) 59.8 (0.4) 72.2 (0.7) 78.2 (0.7) 90.3
  • trained with OCNLI.train.small = 30k data points
validation data BiLSTM BERT RoBERTa human
dev 58.7 (0.3) 72.6 (0.9) 77.4 (1.0) na
test 57.0 (0.9) 70.3 (0.9) 76.4 (1.2) 90.3
  • trained with OCNLI.train.small = 10k data points
validation data BERT RoBERTa human
dev 69.2 (0.5) 75.2 (0.3) na
test 67.0 (0.6) 73.6 (0.5) 90.3
  • trained with OCNLI.train.small = 3k data points
validation data BERT RoBERTa human
dev 64.4 (0.7) 70.4 (0.6) na
test 62.8 (0.7) 69.5 (0.5) 90.3

More details about OCNLI

  • OCNLI is collected using an enhanced SNLI/MNLI procedure where the biggest difference is that the annotators were instrcucted to write 3 hypotheses per label per premise, instead of 1. That is, an annotator needs to produce 3 entailed hypotheses, 3 neutral ones and 3 contradictions, for a given premise. We believe this forces the annotators to produce more challening hypotheses. At the time of publication, there is a 13% gap between human performance and that of the best model. 【OCNLI改进了SNLI、MNLI数据收集和标注方法,使数据难度更大,对现有模型更有挑战性。目前(2020年10月)人类测评得分比模型最高分高出12%。】

  • Our premises come from 5 genres: government documents, news, literature, TV show transcripts, and telephone conversation transcripts. 【OCNLI的前提(premise)选自5种不同的文体:政府公文、新闻、文学、电视谈话节目、电话录音。】

  • Similar to SNLI/MNLI, we selected a portion of the collected premise-hypothesis pairs for relabelling as sanity check, and all our dev and test data have received at least 3 out of 5 majority votes. The 'bad' ones have a label of '-' and should be excluded in experiments. Our annotator agreement is slightly better than SNLI/MNLI in 2 out of the 4 annotation conditions and similar to SNLI/MNLI in the other 2. 【与SNLI、MNLI类似,我们选取了部分数据进行二次标注,以确保标签的准确性。所有验证和测试数据的标签均为3/5多数投票决定,不合格的数据点标签为"-",实验中应将这些数据排除。】

  • We believe our dataset is challenging and of high-quality. This is due in no small part to all of our annotators, who are undergraduate students studying language-related subjects in Chinese universities, rather than crowd workers. 【为了保证数据质量,我们的标注人员均为语言相关专业的本科生。OCNLI的完成离不开所有参与标注同学的辛勤努力,我们在此表示感谢!】

  • Example pairs from OCNLI:

sentence1 sentence2 source label
但是不光是中国,日本,整个东亚文化都有这个特点就是被权力影响很深 有超过两个东亚国家有这个特点 OCNLI E
完善加工贸易政策体 贸易政策体系还有不足之处 OCNLI E
咖啡馆里面对面坐的年轻男女也是上一代的故事,她已是过来人了 男人和女人是背对背坐着的 OCNLI C
今天,这一受人关注的会议终于在波恩举行 这一会议原定于昨天举行 OCNLI N
嗯,今天星期六我们这儿,嗯哼. 昨天是星期天 OCNLI C

Why not XNLI?

While XNLI has been helpful in multi-lingual NLI research, the quality of XNLI Chinese data is far from satisfactory; here are just a few bad examples we found when annotating 300 randomly sampled examples from XNLI dev:

sentence1 sentence2 source label
Louisa May Alcott和Nathaniel Hawthorne 住在Pinckney街道,而 那个被Oliver Wendell Holmes称为 “晴天街道 的Beacon Street街道住着有些喜欢自吹自擂的历史学家 William Prescott Hawthorne住在Main Street上 XNLI dev C
看看东方的Passeig de Gracia,特别是Diputacie,Consell de Cent,Mallorca和Valancia,直到Mercat de la Concepcie市场 市场出售大量的水果和蔬菜 XNLI dev N
Leisure Modern medicine and hygiene学说已经解决了过去占据我们免疫系统的大部分问题 人类是唯一没有免疫系统的生物 XNLI dev C
政府,法律的batta, begar, chaprasi, dakoit, dakoity, dhan, dharna, kotwal, kotwali, panchayat, pottah, sabha 所有的单词都很容易理解 XNLI dev C
下一阶段,中情局基地组织的负责人当时回忆说,他不认为他的职责是指导应该做什么或不应该做什么 导演认为这完全取决于他 XNLI dev C

Related resources

  • CLUE: Chinese Language Understanding Evaluation benchmark
  • SNLI: Stanford NLI corpus
  • MNLI: Multi-genre NLI corpus
  • XNLI: Cross-Lingual NLI corpus
  • ANLI: Adversarial NLI corpus

TODO

  • set up submission of test results on CLUE
  • code for baseline models in Huggingface [Feel free to make a PR]

Contributors

Hai Hu, Kyle Richardson, Liang Xu, Lu Li, Sandra Kuebler and Larry Moss

Acknowledgements

We greatly appreciate the hard work of our annotators, who are from the following universities: Xiamen University, Beijing Foreign Studies University, University of Electronic Science and Technology of China, and Beijing Normal University. We also want to thank Ruoze Huang, Zhaohong Wu, Jueyan Wu and Xiaojie Gong for helping us to find the annotators. This project is funded by Grant-in-Aid of Doctoral Research from Indiana University Graduate School and the CLUE benchmark.

License

  • Attribution-NonCommercial 2.0 Generic (CC BY-NC 2.0)
  • The premises in the news genre are sampled from the LCMC corpus (ISLRN ID: 990-638-120-277-2, ELRA reference: ELRA-W0039), with permission from ELRA.

Citation

Please cite the following paper if you use OCNLI in your research

@inproceedings{ocnli,
	title={OCNLI: Original Chinese Natural Language Inference},
	author={Hai Hu and Kyle Richardson and Liang Xu and Lu Li and Sandra Kuebler and Larry Moss},
	booktitle={Findings of EMNLP},
	year={2020},
	url={https://arxiv.org/abs/2010.05444}
}

More Repositories

1

CLUEDatasetSearch

搜索所有中文NLP数据集,附常用英文NLP数据集
Python
4,133
star
2

CLUE

中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Python
3,983
star
3

SuperCLUE

SuperCLUE: 中文通用大模型综合性基准 | A Benchmark for Foundation Models in Chinese
2,974
star
4

CLUENER2020

CLUENER2020 中文细粒度命名实体识别 Fine Grained Named Entity Recognition
Python
1,441
star
5

CLUECorpus2020

Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料
919
star
6

CLUEPretrainedModels

高质量中文预训练模型集合:最先进大模型、最快小模型、相似度专门模型
Python
803
star
7

FewCLUE

FewCLUE 小样本学习测评基准,中文版
Python
492
star
8

pCLUE

pCLUE: 1000000+多任务提示学习数据集
Jupyter Notebook
467
star
9

KgCLUE

KgCLUE: 大规模中文开源知识图谱问答
Python
425
star
10

SimCLUE

3000000+语义理解与匹配数据集。可用于无监督对比学习、半监督学习等构建中文领域效果最好的预训练模型
Python
277
star
11

CLGE

Chinese Language Generation Evaluation 中文生成任务基准测评
Python
246
star
12

DataCLUE

DataCLUE: 数据为中心的NLP基准和工具包
Python
144
star
13

SuperCLUElyb

SuperCLUE琅琊榜:中文通用大模型匿名对战评价基准
141
star
14

ELECTRA

中文 预训练 ELECTRA 模型: 基于对抗学习 pretrain Chinese Model
140
star
15

PyCLUE

Python toolkit for Chinese Language Understanding(CLUE) Evaluation benchmark
Python
128
star
16

SuperCLUE-Llama2-Chinese

Llama2开源模型中文版-全方位测评,基于SuperCLUE的OPEN基准 | Llama2 Chinese evaluation with SuperCLUE
127
star
17

SuperCLUE-Safety

SC-Safety: 中文大模型多轮对抗安全基准
100
star
18

SuperCLUE-RAG

中文原生检索增强生成测评基准
94
star
19

DistilBert

DistilBERT for Chinese 海量中文预训练蒸馏bert模型
89
star
20

SuperCLUE-Agent

SuperCLUE-Agent: 基于中文原生任务的Agent智能体核心能力测评基准
78
star
21

SuperCLUE-Open

中文通用大模型开放域多轮测评基准 | An Open Domain Benchmark for Foundation Models in Chinese
76
star
22

QBQTC

QBQTC: 大规模搜索匹配数据集
Python
69
star
23

CLUEWSC2020

CLUEWSC2020: WSC Winograd模式挑战中文版,中文指代消解任务
67
star
24

modelfun

一站式自动化开源标注平台
Java
62
star
25

MobileQA

离线端阅读理解应用 QA for mobile, Android & iPhone
Python
60
star
26

LightLM

高性能小模型测评 Shared Tasks in NLPCC 2020. Task 1 - Light Pre-Training Chinese Language Model for NLP Task
Python
57
star
27

ZeroCLUE

零样本学习测评基准,中文版
Python
54
star
28

SuperCLUE-Math6

SuperCLUE-Math6:新一代中文原生多轮多步数学推理数据集的探索之旅
Python
40
star
29

SuperCLUE-Auto

汽车行业中文大模型测评基准,基于多轮开放式问题的细粒度评测
28
star
30

KgCLUEbench

benchmark of KgCLUE, with different models and methods
Python
26
star
31

SuperCLUE-Role

SuperCLUE-Role中文原生角色扮演测评基准
21
star
32

SuperCLUE-Llama3-Chinese

Llama3开源模型中文版-全方位测评,基于SuperCLUE基准 | Llama3 Chinese Evaluation with SuperCLUE
17
star
33

SuperCLUE-Video

中文原生多层次文生视频测评基准
16
star
34

LGEB

LGEB: Benchmark of Language Generation Evaluation
Python
16
star
35

SuperCLUE-Industry

中文原生工业测评基准
13
star
36

SuperCLUEgkzw

SuperCLUE高考作文机器自动阅卷系统
12
star
37

SuperCLUE-Code3

中文原生等级化代码能力测试基准
10
star
38

KGQA

Knowledge Graph based Question Answering benchmark.
10
star
39

SuperCLUE-ICabin

汽车智能座舱大模型测评基准
9
star
40

chatbotzh

This is a Chatbot designed for Chinese developers base on RASA. You could deploy your bot quickly with the help of this things.
Python
8
star
41

CLUEmotionAnalysis2020

CLUE Emotion Analysis Dataset 细粒度情感分析数据集
Python
7
star
42

SuperCLUE-Image

中文原生文生图测评基准
7
star
43

SuperCLUE-Fin

中文金融大模型测评基准,六大类二十五任务、等级化评价,国内模型获得A级
7
star
44

SuperCLUE-Long

中文原生长文本测评基准
5
star
45

SuperCLUE-V

中文原生多模态理解测评基准(测评方案)
3
star
46

2024h1

中文大模型基准测评2024上半年度报告,Report of LLMs in Chinese, First Half of 2024
1
star