• Stars
    star
    628
  • Rank 71,541 (Top 2 %)
  • Language
    Python
  • Created about 5 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A dataset containing human-human knowledge-grounded open-domain conversations.

Topical-Chat

We introduce Topical-Chat, a knowledge-grounded human-human conversation dataset where the underlying knowledge spans 8 broad topics and conversation partners don’t have explicitly defined roles.

Topical-Chat broadly consists of two types of files:

  • Conversations: JSON files containing conversations between pairs of Amazon Mechanical Turk workers.
  • Reading Sets: JSON files containing knowledge sections rendered as reading content to the Turkers having conversations.

We provide a simple script, build.py, to build the reading sets for the dataset, by making API calls to the relevant sources of the data.

For detailed information about the dataset, modeling benchmarking experiments and evaluation results, please refer to our paper.

Prerequisites

After cloning this repo, please run the following commands, preferably after creating a virtual environment:

pip install -r src/requirements.txt
mkdir reading_sets/post-build/

Please create your own Reddit API credentials, and manually add them to src/reddit/prawler.py

The scripts in this repo have been tested with Python 3.7 and we recommend using Python 3.7.

Build

Run python build.py, after having manually added your own Reddit credentials in src/reddit/prawler.py and creating a reading_sets/post-build/ directory.

build.py will read each file in reading_sets/pre-build/, create a replica JSON with the exact same name and the actual reading sets included in reading_sets/post-build/

Dataset

Statistics:

Stat Train Valid Freq. Valid Rare Test Freq. Test Rare All
# of conversations 8628 539 539 539 539 10784
# of utterances 188378 11681 11692 11760 11770 235281
average # of turns per conversation 21.8 21.6 21.7 21.8 21.8 21.8
average length of utterance 19.5 19.8 19.8 19.5 19.5 19.6

Split:

The data is split into 5 distinct groups: train, valid frequent, valid rare, test frequent and test rare. The frequent set contains entities frequently seen in the training set. The rare set contains entities that were infrequently seen in the training set.

Configuration Type:

For each conversation to be collected, we applied a random knowledge configuration from a pre-defined list of configurations, to construct a pair of reading sets to be rendered to the partnered Turkers. Configurations were defined to impose varying degrees of knowledge symmetry or asymmetry between partner Turkers, leading to the collection of a wide variety of conversations.

Reading sets for Turkers 1 and 2 in Config A

Reading sets for Turkers 1 and 2 in Config B

Reading sets for Turkers 1 and 2 in Config C&D

Conversations:

Each JSON file in conversations/ has the following format:

{
<conversation_id>: {
	“article_url”: <article url>,
	“config”: <config>, # one of A, B, C, D
	“content”: [ # ordered list of conversation turns
		{ 
		“agent”: “agent_1”, # or “agent_2”,
		“message” : <message text>,
		“sentiment”: <text>,
		“knowledge_source” : [“AS1”, “Personal Knowledge”, ...],
		“turn_rating”: “Poor”,
		},…
	],
	“conversation_rating”: {
		“agent_1”: “Good”,
		“agent_2”: “Excellent”
		}
},…
}
  • conversation_id: A unique identifier for a conversation in Topical-Chat
  • article_url: URL pointing to the Washington Post article associated with a conversation
  • config: The knowledge configuration applied to obtain a pair of reading sets for a conversation
  • content: An ordered list of conversation turns
    • agent: An identifier for the Turker who generated the message
    • message: The message generated by the agent
    • sentiment: Self-annotation of the sentiment of the message
    • knowledge_source: Self-annotation of the section within the agent's reading set used to generate this message
    • turn_rating: Partner-annotation of the quality of the message
  • conversation_rating: Self-annotation of the quality of the conversation
    • agent_1: Rating of the conversation by Turker 1
    • agent_2: Rating of the conversation by Turker 2

Reading Sets:

Each JSON file in reading_sets/post-build/ has the following format:

{
<conversation_id> : {
	“config” : <config>,
    “agent_1”: {
	    “FS1”: {
		 “entity”: <entity name>,
		 “shortened_wiki_lead_section”: <section text>,
		 “fun_facts”: [ <fact1_text>, <fact2_text>,…]
		    },
	    “FS2”:…
                    },
        ....
        },
    “agent_2”: {
	    “FS1”: {
		 “entity”: <entity name>,
		 “shortened_wiki_lead_section”: <section text>,
		 “fun_facts”: [ <fact1_text>, <fact2_text>,…],
	            },
	    “FS2”:…
                    },
        ...
        },
    “article”: {
		“url”: <url>,
		“headline” : <headline text>,
		“AS1”: <section 1 text>,
		“AS2”: <section 2 text>,
		“AS3”: <section 3 text>,
		“AS4”: <section 4 text>
	    }
	},
…
}
  • conversation_id: A unique identifier for a conversation in Topical-Chat
  • config: The knowledge configuration applied to obtain a pair of reading sets for a conversation
  • agent_{1/2}: Contains the factual sections in this agent's reading set
    • FS{1/2/3}: Identifier for a factual section
      • entity: A real-world entity
      • shortened_wiki_lead_section: A shortened version of the Wikipedia lead section of the entity
      • summarized_wiki_lead_section: A (TextRank) summarized version of the Wikipedia lead section of the entity
      • fun_facts: Crowdsourced and manually curated fun facts about the entity from Reddit's r/todayilearned subreddit
  • article: A Washington Post article common to both partners' reading sets
    • url: URL pointing to the Washington Post article associated with a conversation
    • headline: The headline of the Washington Post article
    • AS{1/2/3/4}: A chunk of the body of the Washington Post article

Wikipedia Data:

src/wiki/wiki.json has the following format:

{
  "shortened_wiki_lead_section": {
    <shortened wiki lead section text>: <unique_identifier>,
    <shortened wiki lead section text>: <unique_identifier>
  },
  "summarized_wiki_lead_section": {
    <summarized wiki lead section text>: <unique_identifier>,
    <summarized wiki lead section text>: <unique_identifier>
  }
}

build.py puts data from wiki.json into the relevant reading sets.

Citation

If you use Topical-Chat in your work, please cite with the following:

@inproceedings{gopalakrishnan2019topical,
  author={Karthik Gopalakrishnan and Behnam Hedayatnia and Qinlang Chen and Anna Gottardi and Sanjeev Kwatra and Anu Venkatesh and Raefer Gabriel and Dilek Hakkani-TĂĽr},
  title={{Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={1891--1895},
  doi={10.21437/Interspeech.2019-3079},
  url={http://dx.doi.org/10.21437/Interspeech.2019-3079}
}
Gopalakrishnan, Karthik, et al. "Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations.", Proc. INTERSPEECH 2019

Acknowledgements

We thank Anju Khatri, Anjali Chadha and Mohammad Shami for their help with the public release of the dataset. We thank Jeff Nunn and Yi Pan for their early contributions to the dataset collection.

More Repositories

1

alexa-skills-kit-sdk-for-nodejs

The Alexa Skills Kit SDK for Node.js helps you get a skill up and running quickly, letting you focus on skill logic instead of boilerplate code.
TypeScript
3,119
star
2

alexa-cookbook

A series of sample code projects to be used for educational purposes during Alexa hackathons and workshops, and as a reference for tutorials and blog posts.
JavaScript
1,845
star
3

avs-device-sdk

An SDK for commercial device makers to integrate Alexa directly into connected products.
C++
1,255
star
4

alexa-skills-kit-sdk-for-java

The Alexa Skills Kit SDK for Java helps you get a skill up and running quickly, letting you focus on skill logic instead of boilerplate code.
Java
817
star
5

alexa-skills-kit-sdk-for-python

The Alexa Skills Kit SDK for Python helps you get a skill up and running quickly, letting you focus on skill logic instead of boilerplate code.
Python
811
star
6

massive

Tools and Modeling Code for the MASSIVE dataset
Python
538
star
7

bort

Repository for the paper "Optimal Subarchitecture Extraction for BERT"
Python
470
star
8

alexa-auto-sdk

The Alexa Auto SDK is for automotive OEMs to integrate Alexa directly into vehicles.
C++
293
star
9

dialoglue

DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue
Python
280
star
10

ask-cli

Alexa Skills Kit Command Line Interface
JavaScript
164
star
11

teach

TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.
Python
135
star
12

alexa-apis-for-python

The Alexa APIs for Python consists of python classes that represent the request and response JSON of Alexa services. These models act as core dependency for the Alexa Skills Kit Python SDK (https://github.com/alexa/alexa-skills-kit-sdk-for-python).
Python
121
star
13

ask-toolkit-for-vscode

ASK Toolkit is an extension for Visual Studio Code (VSC) that that makes it easier for developers to develop and deploy Alexa Skills.
TypeScript
108
star
14

alexa-with-dstc9-track1-dataset

DSTC9 Track 1 - Beyond Domain APIs: Task-oriented Conversational Modeling with Unstructured Knowledge Access
Python
105
star
15

alexa-dataset-contextual-query-rewrite

This repo includes extensions to the Stanford Dialogue Corpus. It contains crowd-sourced rewrites to facilitate research in dialogue state tracking using natural language as the interface.
88
star
16

Commonsense-Dialogues

A crowdsourced dataset of dialogues grounded in social contexts involving utilization of commonsense.
79
star
17

alexa-smart-screen-sdk

⛔️ DEPRECATED Active at https://github.com/alexa/avs-device-sdk
76
star
18

alexa-with-dstc10-track2-dataset

DSTC10 Track 2 - Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations
Python
61
star
19

alexa-apis-for-nodejs

The Alexa APIs for NodeJS consists of JS and Typescript definitions that represent the request and response JSON of Alexa services. These models act as core dependency for the Alexa Skills Kit NodeJS SDK (https://github.com/alexa/alexa-skills-kit-sdk-for-nodejs).
TypeScript
60
star
20

alexa-for-business

This repository holds sample Alexa skill templates for use in enterprise scenarios and in particular for use with Alexa for Business (aws.amazon.com/a4b). Some samples are more complete, such as the Help Desk skill, but others will be smaller in scope, focusing on specific use cases or integrations.
JavaScript
45
star
21

dstc11-track5

DSTC11 Track 5 - Task-oriented Conversational Modeling with Subjective Knowledge
Python
45
star
22

apl-core-library

APL Core Library enables device makers to create their own "APL viewhost", bringing Alexa experiences with visual renderings to new devices or platforms using any programming language that can invoke C/C++ code.
C++
37
star
23

ask-sdk-controls

The ASK SDK Controls framework builds on the ASK SDK for Node.js, offering a scalable solution for creating large, multi-turn skills in code with reusable components called controls.
TypeScript
36
star
24

dstqa

Code for Li Zhou, Kevin Small. Multi-domain Dialogue State Tracking as Dynamic Knowledge Graph Enhanced Question Answering. In NeurIPS 2019 Workshop on Conversational AI
Python
32
star
25

alexa-apis-for-java

The Alexa APIs for Java consists of JAVA POJO classes that represent the request and response JSON of Alexa services. These models act as core dependency for the Alexa Skills Kit Java SDK (https://github.com/alexa/alexa-skills-kit-sdk-for-java ).
Java
30
star
26

kilm

Python
23
star
27

apl-viewhost-web

TypeScript
23
star
28

alexa-end-to-end-slu

This setup allows to train end-to-end neural models for spoken language understanding (SLU).
Python
22
star
29

AIAClientSDK

Device SDK for products that use Alexa Voice Service (AVS) Integration for AWS IoT written in C99. For more information, visit https://docs.aws.amazon.com/iot/latest/developerguide/avs-integration-aws-iot.html
C
19
star
30

ramen

A software for transferring pre-trained English models to foreign languages
Python
18
star
31

schema-guided-nlg

This repository provides the dataset used in "Schema-Guided Natural Language Generation" by Yuheng Du, Shereen Oraby, Vittorio Perera, Minmin Shen, Anjali Narayan-Chen, Tagyoung Chung, Anu Venkatesh, and Dilek Hakkani-Tur.
12
star
32

max-toolkit

The MAX Toolkit provides software which aims to accelerate the development of devices which integrate multiple voice agents. The Toolkit provides guidance to both device makers and agent developers towards this goal.
C++
12
star
33

apl-suggester

TypeScript
11
star
34

places

This is the code for our paper: PLACES: Prompting Language Models for Social Conversation Synthesis
Python
11
star
35

apl-viewhost-android

Java
11
star
36

xlgen-eacl-2023

Python
11
star
37

factual-consistency-analysis-of-dialogs

A human annotated dataset that determines if neural generated responses are factually inconsistent with a knowledge snippet.
11
star
38

apl-client-library

C++
10
star
39

skill-components

Public repository for Alexa Conversations Description Language (ACDL) Reusable components
TypeScript
10
star
40

visitron

VISITRON: A multi-modal Transformer-based model for Cooperative Vision-and-Dialog Navigation (CVDN)
Python
10
star
41

gravl-bert

pytorch implementation for GraVL-BERT paper
Python
9
star
42

wow-plus-plus

WOW++ is a knowledge-grounded dataset containing multiple relevant knowledge sentences for the last turn within a dialog
8
star
43

alexa-point-of-view-dataset

Point of View (POV) conversion dataset. Messages spoken to virtual assistants are converted from sender perspective to virtual assistant's perspective for delivery.
HTML
8
star
44

alexa-dataset-redtab

7
star
45

unreliable-news-detection-biases

Python
6
star
46

amazon-pay-alexa-utils-for-nodejs

TypeScript
6
star
47

conture

ConTurE is a human-chatbot dataset that contains turn level annotations to assess the quality of chatbot responses.
5
star
48

alexa-smart-screen-web-components

A node.js framework for commercial smart screen device makers to integrate Alexa multi-modal features into their products.
TypeScript
5
star
49

amazon-voice-conversion-voicy

This repository contains audio samples from the paper “Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments”
HTML
5
star
50

apl-translator-lottie

TypeScript
4
star
51

alexa-conversations-reusable-dialogs

4
star
52

alexa-with-dstc9-track1-new-model

Python
3
star
53

avs-sdk-oobe-screens-demo

Demo for Alexa Voice Service OOBE flow for screen-based devices. To be used with the AVS Smart Screen SDK.
JavaScript
2
star
54

dial-guide

2
star