• Stars
    star
    449
  • Rank 97,328 (Top 2 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created almost 5 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

ConvLab-2: An Open-Source Toolkit for Building, Evaluating, and Diagnosing Dialogue Systems

ConvLab-2

Build Status

ConvLab-2 is an open-source toolkit that enables researchers to build task-oriented dialogue systems with state-of-the-art models, perform an end-to-end evaluation, and diagnose the weakness of systems. As the successor of ConvLab, ConvLab-2 inherits ConvLab's framework but integrates more powerful dialogue models and supports more datasets. Besides, we have developed an analysis tool and an interactive tool to assist researchers in diagnosing dialogue systems. [paper]

Updates

2022.11.30:

  • ConvLab-3 [paper] release! Building dialog systems on custom datasets is easier now. Most part of ConvLab-2 is retained. New features include:
    • We proposed a unified format for TOD datasets, transformed many commonly used datasets, and adapted models to support the unified format, facilitating research involving many datasets.
    • We added powerful transformer-based models for every module, including two transferable user simulators which can be used for custom datasets.
    • We advanced the RL toolkit. We simplified the process of building the dialogue system and its RL environment, provided plotting tools to compare policies, and offered a wide range of evaluation metrics.

2022.11.14:

  • Due to the potential security risk, The trained models of ConvLab-2 hosted at Azure can not be accessed currently. Therefore, we copied these models and placed them in Hugging Face. We've replaced the model URL in the ConvLab-2 code with the model URL in our Hugging Face repo. If you try to use trained models of ConvLab-2, make sure to update the code.

2021.9.13:

  • Add MultiWOZ 2.3 dataset in data dir. The dataset adds co-reference annotations in addition to corrections of dialogue acts and dialogue states. [paper]

2021.6.18:

  • Add LAUG, an open-source toolkit for Language understanding AUGmentation. It is an automatic method to approximate the natural perturbations to existing data. Augmented data could be used to conduct black-box robustness testing or enhancing training. [paper]
  • Add SC-GPT for NLG. [paper]

Installation

Require python >= 3.6.

Clone this repository:

git clone https://github.com/thu-coai/ConvLab-2.git

Install ConvLab-2 via pip:

cd ConvLab-2
pip install -e .

Tutorials

Documents

Our documents are on https://thu-coai.github.io/ConvLab-2_docs/convlab2.html.

Models

We provide following models:

  • NLU: SVMNLU, MILU, BERTNLU
  • DST: rule, TRADE, SUMBT
  • Policy: rule, Imitation, REINFORCE, PPO, GDPL, MDRG, HDSA, LaRL
  • Simulator policy: Agenda, VHUS
  • NLG: Template, SCLSTM
  • End2End: Sequicity, DAMD, RNN_rollout

For more details about these models, You can refer to README.md under convlab2/$module/$model/$dataset dir such as convlab2/nlu/jointBERT/multiwoz/README.md.

Supported Datasets

  • Multiwoz 2.1
    • We add user dialogue act (inform, request, bye, greet, thank), remove 5 sessions that have incomplete dialogue act annotation and place it under data/multiwoz dir.
    • Train/val/test size: 8434/999/1000. Split as original data.
    • LICENSE: Attribution 4.0 International, url: http://creativecommons.org/licenses/by/4.0/
  • CrossWOZ
    • We offers a rule-based user simulator and a complete set of models for building a pipeline system on the CrossWOZ dataset. We correct few state annotation and place it under data/crosswoz dir.
    • Train/val/test size: 5012/500/500. Split as original data.
    • LICENSE: Attribution 4.0 International, url: http://creativecommons.org/licenses/by/4.0/
  • Camrest
    • We add system dialogue act (inform, request, nooffer) and place it under data/camrest dir.
    • Train/val/test size: 406/135/135. Split as original data.
    • LICENSE: Attribution 4.0 International, url: http://creativecommons.org/licenses/by/4.0/
  • Dealornot

End-to-end Performance on MultiWOZ

Notice: The results are for commits before bdc9dba (inclusive). We will update the results after improving user policy.

We perform end-to-end evaluation (1000 dialogues) on MultiWOZ using the user simulator below (a full example on tests/test_end2end.py) :

# BERT nlu trained on sys utterance
user_nlu = BERTNLU(mode='sys', config_file='multiwoz_sys_context.json', model_file='https://huggingface.co/ConvLab/ConvLab-2_models/resolve/main/bert_multiwoz_sys_context.zip')
user_dst = None
user_policy = RulePolicy(character='usr')
user_nlg = TemplateNLG(is_user=True)
user_agent = PipelineAgent(user_nlu, user_dst, user_policy, user_nlg, name='user')

analyzer = Analyzer(user_agent=user_agent, dataset='multiwoz')

set_seed(20200202)
analyzer.comprehensive_analyze(sys_agent=sys_agent, model_name='sys_agent', total_dialog=1000)

Main metrics (refer to convlab2/evaluator/multiwoz_eval.py for more details):

  • Complete: whether complete the goal. Judged by the Agenda policy instead of external evaluator.
  • Success: whether all user requests have been informed and the booked entities satisfy the constraints.
  • Book: how many the booked entities satisfy the user constraints.
  • Inform Precision/Recall/F1: how many user requests have been informed.
  • Turn(succ/all): average turn number for successful/all dialogues.

Performance (the first row is the default config for each module. Empty entries are set to default config.):

NLU DST Policy NLG Complete rate Success rate Book rate Inform P/R/F1 Turn(succ/all)
BERTNLU RuleDST RulePolicy TemplateNLG 90.5 81.3 91.1 79.7/92.6/83.5 11.6/12.3
MILU RuleDST RulePolicy TemplateNLG 93.3 81.8 93.0 80.4/94.7/84.8 11.3/12.1
BERTNLU RuleDST RulePolicy SCLSTM 48.5 40.2 56.9 62.3/62.5/58.7 11.9/27.1
BERTNLU RuleDST MLEPolicy TemplateNLG 42.7 35.9 17.6 62.8/69.8/62.9 12.1/24.1
BERTNLU RuleDST PGPolicy TemplateNLG 37.4 31.7 17.4 57.4/63.7/56.9 11.0/25.3
BERTNLU RuleDST PPOPolicy TemplateNLG 75.5 71.7 86.6 69.4/85.8/74.1 13.1/17.8
BERTNLU RuleDST GDPLPolicy TemplateNLG 49.4 38.4 20.1 64.5/73.8/65.6 11.5/21.3
None TRADE RulePolicy TemplateNLG 32.4 20.1 34.7 46.9/48.5/44.0 11.4/23.9
None SUMBT RulePolicy TemplateNLG 34.5 29.4 62.4 54.1/50.3/48.3 11.0/28.1
BERTNLU RuleDST MDRG None 21.6 17.8 31.2 39.9/36.3/34.8 15.6/30.5
BERTNLU RuleDST LaRL None 34.8 27.0 29.6 49.1/53.6/47.8 13.2/24.4
None SUMBT LaRL None 32.9 23.7 25.9 48.6/52.0/46.7 12.5/24.3
None None DAMD* None 39.5 34.3 51.4 60.4/59.8/56.3 15.8/29.8

*: end-to-end models used as sys_agent directly.

Module Performance on MultiWOZ

NLU

By running convlab2/nlu/evaluate.py MultiWOZ $model all:

Precision Recall F1
BERTNLU 82.48 85.59 84.01
MILU 80.29 83.63 81.92
SVMNLU 74.96 50.74 60.52

DST

By running convlab2/dst/evaluate.py MultiWOZ $model:

Joint accuracy Slot accuracy Joint F1
MDBT 0.06 0.89 0.43
SUMBT 0.30 0.96 0.83
TRADE 0.40 0.96 0.84

Policy

Notice: The results are for commits before bdc9dba (inclusive). We will update the results after improving user policy.

By running convlab2/policy/evalutate.py --model_name $model

Task Success Rate
MLE 0.56
PG 0.54
PPO 0.89
GDPL 0.58

NLG

By running convlab2/nlg/evaluate.py MultiWOZ $model sys

corpus BLEU-4
Template 0.3309
SCLSTM 0.4884

Translation-train SUMBT for cross-lingual DST

Train

With Convlab-2, you can train SUMBT on a machine-translated dataset like this:

# train.py
import os
from sys import argv

if __name__ == "__main__":
    if len(argv) != 2:
        print('usage: python3 train.py [dataset]')
        exit(1)
    assert argv[1] in ['multiwoz', 'crosswoz']

    from convlab2.dst.sumbt.multiwoz_zh.sumbt import SUMBT_PATH
    if argv[1] == 'multiwoz':
        from convlab2.dst.sumbt.multiwoz_zh.sumbt import SUMBTTracker as SUMBT
    elif argv[1] == 'crosswoz':
        from convlab2.dst.sumbt.crosswoz_en.sumbt import SUMBTTracker as SUMBT

    sumbt = SUMBT()
    sumbt.train(True)

Evaluate

Execute evaluate.py (under convlab2/dst/) with following command:

python3 evaluate.py [CrossWOZ-en|MultiWOZ-zh] [val|test|human_val]

evaluation of our pre-trained models are: (joint acc.)

type CrossWOZ-en MultiWOZ-zh
val 12.4% 48.5%
test 12.4% 46.0%
human_val 10.6% 47.4%

human_val option will make the model evaluate on the validation set translated by human.

Note: You may want to download pre-traiend BERT models and translation-train SUMBT models provided by us.

Without modifying any code, you could:

  • download pre-trained BERT models from:

    extract it to ./pre-trained-models.

  • for translation-train SUMBT model:

    • trained on CrossWOZ-en
    • trained on MultiWOZ-zh
    • Say the data set is CrossWOZ (English), (after extraction) just save the pre-trained model under ./convlab2/dst/sumbt/crosswoz_en/pre-trained and name it with pytorch_model.bin.

Issues

You are welcome to create an issue if you want to request a feature, report a bug or ask a general question.

Contributions

We welcome contributions from community.

  • If you want to make a big change, we recommend first creating an issue with your design.
  • Small contributions can be directly made by a pull request.
  • If you like make contributions to our library, see issues to find what we need.

Team

ConvLab-2 is maintained and developed by Tsinghua University Conversational AI group (THU-coai) and Microsoft Research (MSR).

We would like to thank:

Yan Fang, Zhuoer Feng, Jianfeng Gao, Qihan Guo, Kaili Huang, Minlie Huang, Sungjin Lee, Bing Li, Jinchao Li, Xiang Li, Xiujun Li, Jiexi Liu, Lingxiao Luo, Wenchang Ma, Mehrad Moradshahi, Baolin Peng, Runze Liang, Ryuichi Takanobu, Hongru Wang, Jiaxin Wen, Yaoqin Zhang, Zheng Zhang, Qi Zhu, Xiaoyan Zhu.

Citing

If you use ConvLab-2 in your research, please cite:

@inproceedings{zhu2020convlab2,
    title={ConvLab-2: An Open-Source Toolkit for Building, Evaluating, and Diagnosing Dialogue Systems},
    author={Qi Zhu and Zheng Zhang and Yan Fang and Xiang Li and Ryuichi Takanobu and Jinchao Li and Baolin Peng and Jianfeng Gao and Xiaoyan Zhu and Minlie Huang},
    year={2020},
    booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
}

@inproceedings{liu2021robustness,
    title={Robustness Testing of Language Understanding in Task-Oriented Dialog},
    author={Liu, Jiexi and Takanobu, Ryuichi and Wen, Jiaxin and Wan, Dazhen and Li, Hongguang and Nie, Weiran and Li, Cheng and Peng, Wei and Huang, Minlie},
    year={2021},
    booktitle={Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics},
}

License

Apache License 2.0

More Repositories

1

CDial-GPT

A Large-scale Chinese Short-Text Conversation Dataset and Chinese pre-training dialog models
Python
1,678
star
2

Safety-Prompts

Chinese safety prompts for evaluating and improving the safety of LLMs. 中文安全prompts,用于评估和提升大模型的安全性。
837
star
3

CrossWOZ

A Large-Scale Chinese Cross-Domain Task-Oriented Dialogue Dataset
Python
580
star
4

KdConv

KdConv: A Chinese Multi-domain Dialogue Dataset Towards Multi-turn Knowledge-driven Conversation
Python
455
star
5

CharacterGLM-6B

CharacterGLM: Customizing Chinese Conversational AI Characters with Large Language Models
Python
395
star
6

EVA

EVA: Large-scale Pre-trained Chit-Chat Models
Python
304
star
7

BPO

Python
281
star
8

Emotional-Support-Conversation

Data and codes for ACL 2021 paper: Towards Emotional Support Dialog Systems
Python
227
star
9

ccm

This project is a tensorflow implement of our work, CCM (Commonsense Conversational Model).
Python
218
star
10

ecm

This project is a tensorflow implement of our work, ECM (emotional chatting machine).
Python
216
star
11

NLG_book

书籍《现代自然语言生成》介绍
214
star
12

COLDataset

The official repository of the paper: COLD: A Benchmark for Chinese Offensive Language Detection
201
star
13

PaperForONLG

Paper list for open-ended language generation
187
star
14

PsyQA

一个中文心理健康支持问答数据集,提供了丰富的援助策略标注。可用于生成富有援助策略的长咨询文本。
154
star
15

SafetyBench

Official github repo for SafetyBench, a comprehensive benchmark to evaluate LLMs' safety.
Python
144
star
16

ShieldLM

ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors
Python
139
star
17

cotk

Conversational Toolkit. An Open-Source Toolkit for Fast Development and Fair Evaluation of Text Generation
Python
128
star
18

DA-Transformer

Official Implementation for the ICML2022 paper "Directed Acyclic Transformer for Non-Autoregressive Machine Translation"
Python
114
star
19

PPT

Official Code for "PPT: Pre-trained Prompt Tuning for Few-shot Learning". ACL 2022
Python
104
star
20

CommonsenseStoryGen

Implementation for paper "A Knowledge-Enhanced Pretraining Model for Commonsense Story Generation"
Python
103
star
21

PICL

Code for ACL2023 paper: Pre-Training to Learn in Context
Python
101
star
22

CritiqueLLM

96
star
23

tatk

Task-oriented dialog system toolkits
Python
84
star
24

SentiLARE

Codes for our paper "SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge" (EMNLP 2020)
Python
78
star
25

THUOOP

清华大学面向对象程序设计课程 课程材料及答疑
76
star
26

OPD

OPD: Chinese Open-Domain Pre-trained Dialogue Model
Python
74
star
27

LOT-LongLM

Python
71
star
28

JointGT

Codes for our paper "JointGT: Graph-Text Joint Representation Learning for Text Generation from Knowledge Graphs" (ACL 2021 Findings)
Python
70
star
29

UNION

UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation
Python
57
star
30

OpenMEVA

Benchmark for evaluating open-ended generation
Python
44
star
31

HINT

Python
35
star
32

CTRLEval

Codes for our paper "CTRLEval: An Unsupervised Reference-Free Metric for Evaluating Controlled Text Generation" (ACL 2022)
Python
31
star
33

CPT4DST

Official code for "Continual Prompt Tuning for Dialog State Tracking" (ACL 2022).
Python
28
star
34

seq2seq-pytorch-bert

Python
26
star
35

DiaSafety

This repo is for the paper: On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark
Python
23
star
36

Targeted-Data-Extraction

Official Code for ACL 2023 paper: "Ethicist: Targeted Training Data Extraction Through Loss Smoothed Soft Prompting and Calibrated Confidence Estimation"
Python
23
star
37

TaiLr

ICLR2023 - Tailoring Language Generation Models under Total Variation Distance
Python
20
star
38

SafeUnlearning

Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks
Python
20
star
39

LAUG

Language Understanding Augmentation Toolkit for Robustness Testing
Python
19
star
40

MoralStory

Python
17
star
41

ConPer

Official Code for NAACL 2022 paper: "Persona-Guided Planning for Controlling the Protagonist's Persona in Story Generation"
Python
15
star
42

AugESC

Official repository for the Findings of ACL 2023 paper "AugESC: Dialogue Augmentation with Large Language Models for Emotional Support Conversation"
15
star
43

NAST

Codes for "NAST: A Non-Autoregressive Generator with Word Alignment for Unsupervised Text Style Transfer" (ACL 2021 findings)
Python
14
star
44

CDConv

Data and codes for EMNLP 2022 paper "CDConv: A Benchmark for Contradiction Detection in Chinese Conversations"
Python
13
star
45

JailbreakDefense_GoalPriority

[ACL 2024] Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization
Python
11
star
46

AutoCAD

Official Code for EMNLP 2022 findings paper: "AutoCAD: Automatically Generating Counterfactuals for Mitigating Shortcut Learning"
Python
10
star
47

Implicit-Toxicity

Official Code for EMNLP 2023 paper: "Unveiling the Implicit Toxicity in Large Language Models""
Python
8
star
48

grounded-minimal-edit

Code for EMNLP 2021 paper "Transferable Persona-Grounded Dialogues via Grounded Minimal Edits"
Python
8
star
49

hred-tensorflow

Python
7
star
50

EssayCommentGen

Python
7
star
51

UDIT

Official Code for EMNLP2022 Paper: "Learning Instructions with Unlabeled Data for Zero-Shot Cross-Task Generalization"
Python
7
star
52

Reverse_Generation

Python
6
star
53

earl

This project is a tensorflow implementation of our work, EARL.
Python
6
star
54

MoralDial

The official Implementations of the paper: MoralDial: A Framework to Train and Evaluate Moral Dialogue Systems via Moral Discussions
Python
5
star
55

seqGAN-tensorflow

Python
5
star
56

LaMemo

NAACL2022 - LaMemo: Language Modeling with Look-Ahead Memory
Python
5
star
57

Re3Dial

Official Code for EMNLP 2023 paper: "Re3Dial: Retrieve, Reorganize and Rescale Conversations for Long-Turn Open-Domain Dialogue Pre-training"
Python
5
star
58

ERIC

Code for the AAAI 2023 paper "Generating Coherent Narratives by Learning Dynamic and Discrete Entity States with a Contrastive Framework"
Python
4
star
59

DAG-Search

The beamsearch algorithm for DA-Transformer
C++
4
star
60

cotk_docs

Document for cotk package. Refer to: https://github.com/thu-coai/cotk
Python
4
star
61

lightseq-nat

A Modified Version of LightSeq for Non-Autoregressive Transformer
Cuda
3
star
62

seq2seq-pytorch

Python
3
star
63

SelfCont

Code for the paper "Mitigating the Learning Bias towards Repetition by Self-Contrastive Training for Open-Ended Generation"
Python
3
star
64

CodePlan

3
star
65

transformerLM-pytorch

Python
2
star
66

cotk_dashboard

Dashboard for cotk
JavaScript
2
star
67

GPT2LM-pytorch

Python
2
star
68

ConvLab-2_docs

2
star
69

CVAE-tensorflow

Python
2
star
70

GRULM-pytorch

Python
1
star
71

LM-tensorflow

Python
1
star
72

cotk-test-CVAE

Python
1
star
73

tatk_docs

The document of TaTK platform.
1
star
74

seq2seq-tensorflow

Python
1
star
75

VAE-tensorflow

Python
1
star
76

ComplexBench

Python
1
star
77

cotk_data

1
star
78

SST-pytorch

Python
1
star