• Stars
    star
    588
  • Rank 73,124 (Top 2 %)
  • Language
    Python
  • Created over 4 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A dataset containing human-human knowledge-grounded open-domain conversations.

Topical-Chat

We introduce Topical-Chat, a knowledge-grounded human-human conversation dataset where the underlying knowledge spans 8 broad topics and conversation partners don’t have explicitly defined roles.

Topical-Chat broadly consists of two types of files:

  • Conversations: JSON files containing conversations between pairs of Amazon Mechanical Turk workers.
  • Reading Sets: JSON files containing knowledge sections rendered as reading content to the Turkers having conversations.

We provide a simple script, build.py, to build the reading sets for the dataset, by making API calls to the relevant sources of the data.

For detailed information about the dataset, modeling benchmarking experiments and evaluation results, please refer to our paper.

Prerequisites

After cloning this repo, please run the following commands, preferably after creating a virtual environment:

pip install -r src/requirements.txt
mkdir reading_sets/post-build/

Please create your own Reddit API credentials, and manually add them to src/reddit/prawler.py

The scripts in this repo have been tested with Python 3.7 and we recommend using Python 3.7.

Build

Run python build.py, after having manually added your own Reddit credentials in src/reddit/prawler.py and creating a reading_sets/post-build/ directory.

build.py will read each file in reading_sets/pre-build/, create a replica JSON with the exact same name and the actual reading sets included in reading_sets/post-build/

Dataset

Statistics:

Stat Train Valid Freq. Valid Rare Test Freq. Test Rare All
# of conversations 8628 539 539 539 539 10784
# of utterances 188378 11681 11692 11760 11770 235281
average # of turns per conversation 21.8 21.6 21.7 21.8 21.8 21.8
average length of utterance 19.5 19.8 19.8 19.5 19.5 19.6

Split:

The data is split into 5 distinct groups: train, valid frequent, valid rare, test frequent and test rare. The frequent set contains entities frequently seen in the training set. The rare set contains entities that were infrequently seen in the training set.

Configuration Type:

For each conversation to be collected, we applied a random knowledge configuration from a pre-defined list of configurations, to construct a pair of reading sets to be rendered to the partnered Turkers. Configurations were defined to impose varying degrees of knowledge symmetry or asymmetry between partner Turkers, leading to the collection of a wide variety of conversations.

Reading sets for Turkers 1 and 2 in Config A

Reading sets for Turkers 1 and 2 in Config B

Reading sets for Turkers 1 and 2 in Config C&D

Conversations:

Each JSON file in conversations/ has the following format:

{
<conversation_id>: {
	“article_url”: <article url>,
	“config”: <config>, # one of A, B, C, D
	“content”: [ # ordered list of conversation turns
		{ 
		“agent”: “agent_1”, # or “agent_2”,
		“message” : <message text>,
		“sentiment”: <text>,
		“knowledge_source” : [“AS1”, “Personal Knowledge”, ...],
		“turn_rating”: “Poor”,
		},…
	],
	“conversation_rating”: {
		“agent_1”: “Good”,
		“agent_2”: “Excellent”
		}
},…
}
  • conversation_id: A unique identifier for a conversation in Topical-Chat
  • article_url: URL pointing to the Washington Post article associated with a conversation
  • config: The knowledge configuration applied to obtain a pair of reading sets for a conversation
  • content: An ordered list of conversation turns
    • agent: An identifier for the Turker who generated the message
    • message: The message generated by the agent
    • sentiment: Self-annotation of the sentiment of the message
    • knowledge_source: Self-annotation of the section within the agent's reading set used to generate this message
    • turn_rating: Partner-annotation of the quality of the message
  • conversation_rating: Self-annotation of the quality of the conversation
    • agent_1: Rating of the conversation by Turker 1
    • agent_2: Rating of the conversation by Turker 2

Reading Sets:

Each JSON file in reading_sets/post-build/ has the following format:

{
<conversation_id> : {
	“config” : <config>,
    “agent_1”: {
	    “FS1”: {
		 “entity”: <entity name>,
		 “shortened_wiki_lead_section”: <section text>,
		 “fun_facts”: [ <fact1_text>, <fact2_text>,…]
		    },
	    “FS2”:…
                    },
        ....
        },
    “agent_2”: {
	    “FS1”: {
		 “entity”: <entity name>,
		 “shortened_wiki_lead_section”: <section text>,
		 “fun_facts”: [ <fact1_text>, <fact2_text>,…],
	            },
	    “FS2”:…
                    },
        ...
        },
    “article”: {
		“url”: <url>,
		“headline” : <headline text>,
		“AS1”: <section 1 text>,
		“AS2”: <section 2 text>,
		“AS3”: <section 3 text>,
		“AS4”: <section 4 text>
	    }
	},
…
}
  • conversation_id: A unique identifier for a conversation in Topical-Chat
  • config: The knowledge configuration applied to obtain a pair of reading sets for a conversation
  • agent_{1/2}: Contains the factual sections in this agent's reading set
    • FS{1/2/3}: Identifier for a factual section
      • entity: A real-world entity
      • shortened_wiki_lead_section: A shortened version of the Wikipedia lead section of the entity
      • summarized_wiki_lead_section: A (TextRank) summarized version of the Wikipedia lead section of the entity
      • fun_facts: Crowdsourced and manually curated fun facts about the entity from Reddit's r/todayilearned subreddit
  • article: A Washington Post article common to both partners' reading sets
    • url: URL pointing to the Washington Post article associated with a conversation
    • headline: The headline of the Washington Post article
    • AS{1/2/3/4}: A chunk of the body of the Washington Post article

Wikipedia Data:

src/wiki/wiki.json has the following format:

{
  "shortened_wiki_lead_section": {
    <shortened wiki lead section text>: <unique_identifier>,
    <shortened wiki lead section text>: <unique_identifier>
  },
  "summarized_wiki_lead_section": {
    <summarized wiki lead section text>: <unique_identifier>,
    <summarized wiki lead section text>: <unique_identifier>
  }
}

build.py puts data from wiki.json into the relevant reading sets.

Citation

If you use Topical-Chat in your work, please cite with the following:

@inproceedings{gopalakrishnan2019topical,
  author={Karthik Gopalakrishnan and Behnam Hedayatnia and Qinlang Chen and Anna Gottardi and Sanjeev Kwatra and Anu Venkatesh and Raefer Gabriel and Dilek Hakkani-TĂĽr},
  title={{Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={1891--1895},
  doi={10.21437/Interspeech.2019-3079},
  url={http://dx.doi.org/10.21437/Interspeech.2019-3079}
}
Gopalakrishnan, Karthik, et al. "Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations.", Proc. INTERSPEECH 2019

Acknowledgements

We thank Anju Khatri, Anjali Chadha and Mohammad Shami for their help with the public release of the dataset. We thank Jeff Nunn and Yi Pan for their early contributions to the dataset collection.

More Repositories

1

alexa-skills-kit-sdk-for-nodejs

The Alexa Skills Kit SDK for Node.js helps you get a skill up and running quickly, letting you focus on skill logic instead of boilerplate code.
TypeScript
3,106
star
2

alexa-cookbook

A series of sample code projects to be used for educational purposes during Alexa hackathons and workshops, and as a reference for tutorials and blog posts.
JavaScript
1,845
star
3

avs-device-sdk

An SDK for commercial device makers to integrate Alexa directly into connected products.
C++
1,250
star
4

alexa-skills-kit-sdk-for-java

The Alexa Skills Kit SDK for Java helps you get a skill up and running quickly, letting you focus on skill logic instead of boilerplate code.
Java
811
star
5

alexa-skills-kit-sdk-for-python

The Alexa Skills Kit SDK for Python helps you get a skill up and running quickly, letting you focus on skill logic instead of boilerplate code.
Python
795
star
6

massive

Tools and Modeling Code for the MASSIVE dataset
Python
527
star
7

bort

Repository for the paper "Optimal Subarchitecture Extraction for BERT"
Python
472
star
8

alexa-auto-sdk

The Alexa Auto SDK is for automotive OEMs to integrate Alexa directly into vehicles.
C++
288
star
9

dialoglue

DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue
Python
275
star
10

ask-cli

Alexa Skills Kit Command Line Interface
JavaScript
154
star
11

teach

TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.
Python
124
star
12

alexa-apis-for-python

The Alexa APIs for Python consists of python classes that represent the request and response JSON of Alexa services. These models act as core dependency for the Alexa Skills Kit Python SDK (https://github.com/alexa/alexa-skills-kit-sdk-for-python).
Python
112
star
13

ask-toolkit-for-vscode

ASK Toolkit is an extension for Visual Studio Code (VSC) that that makes it easier for developers to develop and deploy Alexa Skills.
TypeScript
104
star
14

alexa-with-dstc9-track1-dataset

DSTC9 Track 1 - Beyond Domain APIs: Task-oriented Conversational Modeling with Unstructured Knowledge Access
Python
100
star
15

alexa-dataset-contextual-query-rewrite

This repo includes extensions to the Stanford Dialogue Corpus. It contains crowd-sourced rewrites to facilitate research in dialogue state tracking using natural language as the interface.
83
star
16

alexa-smart-screen-sdk

⛔️ DEPRECATED Active at https://github.com/alexa/avs-device-sdk
75
star
17

Commonsense-Dialogues

A crowdsourced dataset of dialogues grounded in social contexts involving utilization of commonsense.
74
star
18

alexa-apis-for-nodejs

The Alexa APIs for NodeJS consists of JS and Typescript definitions that represent the request and response JSON of Alexa services. These models act as core dependency for the Alexa Skills Kit NodeJS SDK (https://github.com/alexa/alexa-skills-kit-sdk-for-nodejs).
TypeScript
61
star
19

alexa-with-dstc10-track2-dataset

DSTC10 Track 2 - Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations
Python
59
star
20

alexa-for-business

This repository holds sample Alexa skill templates for use in enterprise scenarios and in particular for use with Alexa for Business (aws.amazon.com/a4b). Some samples are more complete, such as the Help Desk skill, but others will be smaller in scope, focusing on specific use cases or integrations.
JavaScript
43
star
21

dstc11-track5

DSTC11 Track 5 - Task-oriented Conversational Modeling with Subjective Knowledge
Python
40
star
22

apl-core-library

APL Core Library enables device makers to create their own "APL viewhost", bringing Alexa experiences with visual renderings to new devices or platforms using any programming language that can invoke C/C++ code.
C++
35
star
23

ask-sdk-controls

The ASK SDK Controls framework builds on the ASK SDK for Node.js, offering a scalable solution for creating large, multi-turn skills in code with reusable components called controls.
TypeScript
34
star
24

dstqa

Code for Li Zhou, Kevin Small. Multi-domain Dialogue State Tracking as Dynamic Knowledge Graph Enhanced Question Answering. In NeurIPS 2019 Workshop on Conversational AI
Python
28
star
25

alexa-apis-for-java

The Alexa APIs for Java consists of JAVA POJO classes that represent the request and response JSON of Alexa services. These models act as core dependency for the Alexa Skills Kit Java SDK (https://github.com/alexa/alexa-skills-kit-sdk-for-java ).
Java
28
star
26

kilm

Python
21
star
27

alexa-end-to-end-slu

This setup allows to train end-to-end neural models for spoken language understanding (SLU).
Python
20
star
28

AIAClientSDK

Device SDK for products that use Alexa Voice Service (AVS) Integration for AWS IoT written in C99. For more information, visit https://docs.aws.amazon.com/iot/latest/developerguide/avs-integration-aws-iot.html
C
19
star
29

apl-viewhost-web

TypeScript
18
star
30

ramen

A software for transferring pre-trained English models to foreign languages
Python
17
star
31

max-toolkit

The MAX Toolkit provides software which aims to accelerate the development of devices which integrate multiple voice agents. The Toolkit provides guidance to both device makers and agent developers towards this goal.
C++
11
star
32

apl-client-library

C++
10
star
33

places

This is the code for our paper: PLACES: Prompting Language Models for Social Conversation Synthesis
Python
10
star
34

apl-suggester

TypeScript
9
star
35

schema-guided-nlg

This repository provides the dataset used in "Schema-Guided Natural Language Generation" by Yuheng Du, Shereen Oraby, Vittorio Perera, Minmin Shen, Anjali Narayan-Chen, Tagyoung Chung, Anu Venkatesh, and Dilek Hakkani-Tur.
9
star
36

visitron

VISITRON: A multi-modal Transformer-based model for Cooperative Vision-and-Dialog Navigation (CVDN)
Python
9
star
37

apl-viewhost-android

C++
9
star
38

xlgen-eacl-2023

Python
9
star
39

factual-consistency-analysis-of-dialogs

A human annotated dataset that determines if neural generated responses are factually inconsistent with a knowledge snippet.
9
star
40

gravl-bert

pytorch implementation for GraVL-BERT paper
Python
8
star
41

skill-components

Public repository for Alexa Conversations Description Language (ACDL) Reusable components
TypeScript
7
star
42

wow-plus-plus

WOW++ is a knowledge-grounded dataset containing multiple relevant knowledge sentences for the last turn within a dialog
7
star
43

amazon-pay-alexa-utils-for-nodejs

TypeScript
6
star
44

alexa-dataset-redtab

5
star
45

alexa-point-of-view-dataset

Point of View (POV) conversion dataset. Messages spoken to virtual assistants are converted from sender perspective to virtual assistant's perspective for delivery.
HTML
5
star
46

alexa-smart-screen-web-components

A node.js framework for commercial smart screen device makers to integrate Alexa multi-modal features into their products.
TypeScript
5
star
47

conture

ConTurE is a human-chatbot dataset that contains turn level annotations to assess the quality of chatbot responses.
4
star
48

amazon-voice-conversion-voicy

This repository contains audio samples from the paper “Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments”
HTML
4
star
49

apl-translator-lottie

TypeScript
3
star
50

unreliable-news-detection-biases

Python
3
star
51

alexa-conversations-reusable-dialogs

2
star
52

alexa-with-dstc9-track1-new-model

Python
1
star
53

avs-sdk-oobe-screens-demo

Demo for Alexa Voice Service OOBE flow for screen-based devices. To be used with the AVS Smart Screen SDK.
JavaScript
1
star