• Stars
    star
    280
  • Rank 147,492 (Top 3 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created about 4 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue

DialoGLUE

DialoGLUE is a conversational AI benchmark designed to encourage dialogue research in representation-based transfer, domain adaptation, and sample-efficient task learning. For a more detailed write-up of the benchmark check out our paper.

This repository contains all code related to the benchmark, including scripts for downloading relevant datasets, preprocessing them in a consistent format for benchmark submissions, evaluating any submission outputs, and running baseline models from the original benchmark description.

This repository also contains code for our NAACL paper, Example-Driven Intent Prediction with Observers.

Datasets

This benchmark is created by scripts that pull data from previously-published data resources. Thank you to the authors of those works for their great contributions:

Dataset Size Description License
Banking77 13K online banking queries CC-BY-4.0
HWU64 11K popular personal assistant queries CC-BY-SA 3.0
CLINC150 20K popular personal assistant queries CC-BY-SA 3.0
Restaurant8k 8.2K restaurant booking domain queries CC-BY-4.0
DSTC8 SGD 20K multi-domain, task-oriented conversations between a human and a virtual assistant CC-BY-SA 4.0 International
TOP 45K compositional queries for hierachical semantic representations CC-BY-SA
MultiWOZ 2.1 12K multi-domain dialogues with multiple turns MIT

Data Download

To download/process the various datasets that are part of the DialoGLUE benchmark, run bash download_data.sh from data_utils.

Upon completion, your DialoGLUE folder should contain the following:

dialoglue/
β”œβ”€β”€ banking
β”‚   β”œβ”€β”€ categories.json
β”‚   β”œβ”€β”€ test.csv
β”‚   β”œβ”€β”€ train_10.csv
β”‚   β”œβ”€β”€ train_5.csv
β”‚   β”œβ”€β”€ train.csv
β”‚   └── val.csv
β”œβ”€β”€ clinc
β”‚   β”œβ”€β”€ categories.json
β”‚   β”œβ”€β”€ test.csv
β”‚   β”œβ”€β”€ train_10.csv
β”‚   β”œβ”€β”€ train_5.csv
β”‚   β”œβ”€β”€ train.csv
β”‚   └── val.csv
β”œβ”€β”€ dstc8_sgd
β”‚   β”œβ”€β”€ stats.csv
β”‚   β”œβ”€β”€ test.json
β”‚   β”œβ”€β”€ train_10.json
β”‚   β”œβ”€β”€ train.json
β”‚   β”œβ”€β”€ val.json
β”‚   └── vocab.txt
β”œβ”€β”€ hwu
β”‚   β”œβ”€β”€ categories.json
β”‚   β”œβ”€β”€ test.csv
β”‚   β”œβ”€β”€ train_10.csv
β”‚   β”œβ”€β”€ train_5.csv
β”‚   β”œβ”€β”€ train.csv
β”‚   └── val.csv
β”œβ”€β”€ mlm_all.txt
β”œβ”€β”€ multiwoz
β”‚   β”œβ”€β”€ MULTIWOZ2.1
β”‚   β”‚   β”œβ”€β”€ dialogue_acts.json
β”‚   β”‚   β”œβ”€β”€ README.txt
β”‚   β”‚   β”œβ”€β”€ test_dials.json
β”‚   β”‚   β”œβ”€β”€ train_dials.json
β”‚   β”‚   └── val_dials.json
β”‚   └── MULTIWOZ2.1_fewshot
β”‚       β”œβ”€β”€ dialogue_acts.json
β”‚       β”œβ”€β”€ README.txt
β”‚       β”œβ”€β”€ test_dials.json
β”‚       β”œβ”€β”€ train_dials.json
β”‚       └── val_dials.json
β”œβ”€β”€ restaurant8k
β”‚   β”œβ”€β”€ test.json
β”‚   β”œβ”€β”€ train_10.json
β”‚   β”œβ”€β”€ train.json
β”‚   β”œβ”€β”€ val.json
β”‚   └── vocab.txt
└── top
    β”œβ”€β”€ eval.txt
    β”œβ”€β”€ test.txt
    β”œβ”€β”€ train_10.txt
    β”œβ”€β”€ train.txt
    β”œβ”€β”€ vocab.intent
    └── vocab.slot

The files with a _10 suffix (e.g., banking/train_10.csv) are used for the few-shot experiments, wherein models are trained with only 10% of the datasets.

EvalAI Leaderboard

The DialoGLUE benchmark is hosted on EvalAI and we invite submissions to our leaderboard. The submission should be a JSON file with keys corresponding to each of the seven DialoGLUE datasets:

{"banking": [*banking outputs*], "hwu": [*hwu outputs*], ..., "multiwoz": [*multiwoz outputs*]}

Given a set of seven model checkpoints, you can edit and run dump_outputs.py to generate a valid submission file. For the intent classification tasks (HWU64, Banking77, CLINC150), the outputs are a list of intent classes. For the slot filling tasks (Restaurant8k, DSTC8), the outputs are a list of spans. For the TOP dataset, the outputs are a list of (intent, slots) pairs wherein each slot is the path from the root to the leaf node. For MultiWOZ, the output follows the TripPy format and corresponds to the pred_res.test.final.json output file. We strongly recommend using the dump_outputs.py script to generate outputs.

For the few-shot experimental setting, we only train models on a subset (roughly 10%) of the training data. The specific data splits are produced by running download_data.sh. To mitigate the impact of random initialization, we ask that you train 5 models for each of the few-shot tasks and submit the output of all 5 models. The scores on the leaderboard will be the average of these five runs.

The few-shot submission file format is as follows:

{"banking": [*banking outputs from model 1*, ... *banking outputs from model 5*], ...}

You may run dump_outputs_fewshot.py to generate a valid submission file given the model paths corresponding to all of the runs.

Training

Almost all of the models can be trained/evaluated using the run.py script. MultiWOZ is the exception, and relies on the modified open-sourced TripPy implementation.

The commands for training/evaluating models are as follows. If you want to only run inference/evaluation, simply change --num_epochs to 0.

To train using example-driven intent prediction, add the --example flag to the training script. To use observer nodes, add the --use_observers flag.

Checkpoints

The relevant convbert and convbert-dg models can be found here.

HWU64

python run.py \
        --train_data_path data_utils/dialoglue/hwu/train.csv \
        --val_data_path data_utils/dialoglue/hwu/val.csv \
        --test_data_path data_utils/dialoglue/hwu/test.csv \
        --token_vocab_path bert-base-uncased-vocab.txt \
        --train_batch_size 64 --dropout 0.1 --num_epochs 100 --learning_rate 6e-5 \
        --model_name_or_path convbert-dg --task intent --do_lowercase --max_seq_length 50 --mlm_pre --mlm_during --dump_outputs \

Banking77

python run.py \
        --train_data_path data_utils/dialoglue/banking/train.csv \
        --val_data_path data_utils/dialoglue/banking/val.csv \
        --test_data_path data_utils/dialoglue/banking/test.csv \
        --token_vocab_path bert-base-uncased-vocab.txt \
        --train_batch_size 32 --grad_accum 2 --dropout 0.1 --num_epochs 100 --learning_rate 6e-5 \
        --model_name_or_path convbert-dg --task intent --do_lowercase --max_seq_length 100 --mlm_pre --mlm_during --dump_outputs \

CLINC150

python run.py \
        --train_data_path data_utils/dialoglue/clinc/train.csv \
        --val_data_path data_utils/dialoglue/clinc/val.csv \
        --test_data_path data_utils/dialoglue/clinc/test.csv \
        --token_vocab_path bert-base-uncased-vocab.txt \
        --train_batch_size 64 --dropout 0.1 --num_epochs 100 --learning_rate 6e-5 \
        --model_name_or_path convbert-dg --task intent --do_lowercase --max_seq_length 50 --mlm_pre --mlm_during --dump_outputs \

Restaurant8k

python run.py \
        --train_data_path data_utils/dialoglue/restaurant8k/train.json \
        --val_data_path data_utils/dialoglue/restaurant8k/val.json \
        --test_data_path data_utils/dialoglue/restaurant8k/test.json \
        --token_vocab_path bert-base-uncased-vocab.txt \
        --train_batch_size 64 --dropout 0.1 --num_epochs 100 --learning_rate 6e-5 \
        --model_name_or_path convbert-dg --task slot --do_lowercase --max_seq_length 50 --mlm_pre --mlm_during --dump_outputs \

DSTC8

python run.py \
        --train_data_path data_utils/dialoglue/dstc8/train.json \
        --val_data_path data_utils/dialoglue/dstc8/val.json \
        --test_data_path data_utils/dialoglue/dstc8/test.json \
        --token_vocab_path bert-base-uncased-vocab.txt \
        --train_batch_size 64 --dropout 0.1 --num_epochs 100 --learning_rate 6e-5 \
        --model_name_or_path convbert-dg --task slot --do_lowercase --max_seq_length 50 --mlm_pre --mlm_during --dump_outputs \

TOP

python run.py \
        --train_data_path data_utils/dialoglue/top/train.txt \
        --val_data_path data_utils/dialoglue/top/eval.txt \
        --test_data_path data_utils/dialoglue/top/test.txt \
        --token_vocab_path bert-base-uncased-vocab.txt \
        --train_batch_size 64 --dropout 0.1 --num_epochs 100 --learning_rate 6e-5 \
        --model_name_or_path convbert-dg --task top --do_lowercase --max_seq_length 50 --mlm_pre --mlm_during --dump_outputs \

MultiWOZ

The MultiWOZ code builds on the open-sourced TripPy implementation. To train/evaluate the model using our modifications (i.e., MLM pre-training), you can use trippy/DO.example.advanced.

Checkpoints

Checkpoints are released for (1) ConvBERT, (2) BERT-DG and (3) ConvBERT-DG. Given these pre-trained models and the code in this repo, all of our results can be reproduced.

Requirements

This project has been tested and is functional with Python 3.7. The Python dependencies are listed in the requirements.txt.

License

This project is licensed under the Apache-2.0 License.

Citation

If using these scripts or the DialoGLUE benchmark, please cite the following in any relevant work:

@article{MehriDialoGLUE2020,
  title={DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue},
  author={S. Mehri and M. Eric and D. Hakkani-Tur},
  journal={ArXiv},
  year={2020},
  volume={abs/2009.13570}
}

If you use any code pertaining to example-driven training or observers, please cite the following paper:

@article{mehri2020example,
  title={Example-Driven Intent Prediction with Observers},
  author={Mehri, Shikib and Eric, Mihail and Hakkani-Tur, Dilek},
  journal={arXiv preprint arXiv:2010.08684},
  year={2020}
}

More Repositories

1

alexa-skills-kit-sdk-for-nodejs

The Alexa Skills Kit SDK for Node.js helps you get a skill up and running quickly, letting you focus on skill logic instead of boilerplate code.
TypeScript
3,119
star
2

alexa-cookbook

A series of sample code projects to be used for educational purposes during Alexa hackathons and workshops, and as a reference for tutorials and blog posts.
JavaScript
1,845
star
3

avs-device-sdk

An SDK for commercial device makers to integrate Alexa directly into connected products.
C++
1,255
star
4

alexa-skills-kit-sdk-for-java

The Alexa Skills Kit SDK for Java helps you get a skill up and running quickly, letting you focus on skill logic instead of boilerplate code.
Java
817
star
5

alexa-skills-kit-sdk-for-python

The Alexa Skills Kit SDK for Python helps you get a skill up and running quickly, letting you focus on skill logic instead of boilerplate code.
Python
811
star
6

Topical-Chat

A dataset containing human-human knowledge-grounded open-domain conversations.
Python
628
star
7

massive

Tools and Modeling Code for the MASSIVE dataset
Python
538
star
8

bort

Repository for the paper "Optimal Subarchitecture Extraction for BERT"
Python
470
star
9

alexa-auto-sdk

The Alexa Auto SDK is for automotive OEMs to integrate Alexa directly into vehicles.
C++
293
star
10

ask-cli

Alexa Skills Kit Command Line Interface
JavaScript
164
star
11

teach

TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.
Python
135
star
12

alexa-apis-for-python

The Alexa APIs for Python consists of python classes that represent the request and response JSON of Alexa services. These models act as core dependency for the Alexa Skills Kit Python SDK (https://github.com/alexa/alexa-skills-kit-sdk-for-python).
Python
121
star
13

ask-toolkit-for-vscode

ASK Toolkit is an extension for Visual Studio Code (VSC) that that makes it easier for developers to develop and deploy Alexa Skills.
TypeScript
108
star
14

alexa-with-dstc9-track1-dataset

DSTC9 Track 1 - Beyond Domain APIs: Task-oriented Conversational Modeling with Unstructured Knowledge Access
Python
105
star
15

alexa-dataset-contextual-query-rewrite

This repo includes extensions to the Stanford Dialogue Corpus. It contains crowd-sourced rewrites to facilitate research in dialogue state tracking using natural language as the interface.
88
star
16

Commonsense-Dialogues

A crowdsourced dataset of dialogues grounded in social contexts involving utilization of commonsense.
79
star
17

alexa-smart-screen-sdk

⛔️ DEPRECATED Active at https://github.com/alexa/avs-device-sdk
76
star
18

alexa-with-dstc10-track2-dataset

DSTC10 Track 2 - Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations
Python
61
star
19

alexa-apis-for-nodejs

The Alexa APIs for NodeJS consists of JS and Typescript definitions that represent the request and response JSON of Alexa services. These models act as core dependency for the Alexa Skills Kit NodeJS SDK (https://github.com/alexa/alexa-skills-kit-sdk-for-nodejs).
TypeScript
60
star
20

alexa-for-business

This repository holds sample Alexa skill templates for use in enterprise scenarios and in particular for use with Alexa for Business (aws.amazon.com/a4b). Some samples are more complete, such as the Help Desk skill, but others will be smaller in scope, focusing on specific use cases or integrations.
JavaScript
45
star
21

dstc11-track5

DSTC11 Track 5 - Task-oriented Conversational Modeling with Subjective Knowledge
Python
45
star
22

apl-core-library

APL Core Library enables device makers to create their own "APL viewhost", bringing Alexa experiences with visual renderings to new devices or platforms using any programming language that can invoke C/C++ code.
C++
37
star
23

ask-sdk-controls

The ASK SDK Controls framework builds on the ASK SDK for Node.js, offering a scalable solution for creating large, multi-turn skills in code with reusable components called controls.
TypeScript
36
star
24

dstqa

Code for Li Zhou, Kevin Small. Multi-domain Dialogue State Tracking as Dynamic Knowledge Graph Enhanced Question Answering. In NeurIPS 2019 Workshop on Conversational AI
Python
32
star
25

alexa-apis-for-java

The Alexa APIs for Java consists of JAVA POJO classes that represent the request and response JSON of Alexa services. These models act as core dependency for the Alexa Skills Kit Java SDK (https://github.com/alexa/alexa-skills-kit-sdk-for-java ).
Java
30
star
26

kilm

Python
23
star
27

apl-viewhost-web

TypeScript
23
star
28

alexa-end-to-end-slu

This setup allows to train end-to-end neural models for spoken language understanding (SLU).
Python
22
star
29

AIAClientSDK

Device SDK for products that use Alexa Voice Service (AVS) Integration for AWS IoT written in C99. For more information, visit https://docs.aws.amazon.com/iot/latest/developerguide/avs-integration-aws-iot.html
C
19
star
30

ramen

A software for transferring pre-trained English models to foreign languages
Python
18
star
31

schema-guided-nlg

This repository provides the dataset used in "Schema-Guided Natural Language Generation" by Yuheng Du, Shereen Oraby, Vittorio Perera, Minmin Shen, Anjali Narayan-Chen, Tagyoung Chung, Anu Venkatesh, and Dilek Hakkani-Tur.
12
star
32

max-toolkit

The MAX Toolkit provides software which aims to accelerate the development of devices which integrate multiple voice agents. The Toolkit provides guidance to both device makers and agent developers towards this goal.
C++
12
star
33

apl-suggester

TypeScript
11
star
34

places

This is the code for our paper: PLACES: Prompting Language Models for Social Conversation Synthesis
Python
11
star
35

apl-viewhost-android

Java
11
star
36

xlgen-eacl-2023

Python
11
star
37

factual-consistency-analysis-of-dialogs

A human annotated dataset that determines if neural generated responses are factually inconsistent with a knowledge snippet.
11
star
38

apl-client-library

C++
10
star
39

skill-components

Public repository for Alexa Conversations Description Language (ACDL) Reusable components
TypeScript
10
star
40

visitron

VISITRON: A multi-modal Transformer-based model for Cooperative Vision-and-Dialog Navigation (CVDN)
Python
10
star
41

gravl-bert

pytorch implementation for GraVL-BERT paper
Python
9
star
42

wow-plus-plus

WOW++ is a knowledge-grounded dataset containing multiple relevant knowledge sentences for the last turn within a dialog
8
star
43

alexa-point-of-view-dataset

Point of View (POV) conversion dataset. Messages spoken to virtual assistants are converted from sender perspective to virtual assistant's perspective for delivery.
HTML
8
star
44

alexa-dataset-redtab

7
star
45

unreliable-news-detection-biases

Python
6
star
46

amazon-pay-alexa-utils-for-nodejs

TypeScript
6
star
47

conture

ConTurE is a human-chatbot dataset that contains turn level annotations to assess the quality of chatbot responses.
5
star
48

alexa-smart-screen-web-components

A node.js framework for commercial smart screen device makers to integrate Alexa multi-modal features into their products.
TypeScript
5
star
49

amazon-voice-conversion-voicy

This repository contains audio samples from the paper β€œVoicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments”
HTML
5
star
50

apl-translator-lottie

TypeScript
4
star
51

alexa-conversations-reusable-dialogs

4
star
52

alexa-with-dstc9-track1-new-model

Python
3
star
53

avs-sdk-oobe-screens-demo

Demo for Alexa Voice Service OOBE flow for screen-based devices. To be used with the AVS Smart Screen SDK.
JavaScript
2
star
54

dial-guide

2
star