• Stars
    star
    138
  • Rank 264,508 (Top 6 %)
  • Language
    Python
  • License
    Other
  • Created over 6 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Mining individual characters in multiparty dialogue

Character Mining

The Character Mining project challenges machine comprehension on multiparty dialogue. The objective of this project is to infer explicit and implicit contexts about individual characters through their conversations. This is an open-source project led by the Emory NLP research group that provides resources for the following tasks:

We welcome feedbacks and contributions from the community. Most of our annotation are crowdsourced; implying that, errors are expected to be found. Please make pull requests if you wish to fix errors in our datasets.

Dataset

Our dataset is based on the popular TV show called Friends. Transcripts for all 10 seasons of the show as well as manual and crowdsourced annotation for subparts of the show are provided. All text data are available in the JSON files; please visit the individual task pages to retrieve datasets specifically designed for those tasks.

Statistics

Each season consists of episodes, each episode is divided into scenes, each scene comprises utterances, each utterance is a list of sentences where tokens are split.

Season ID Episodes Scenes Utterances Sentences Tokens Speakers
s01 24 326 5,968 10,790 81,453 107
s02 24 293 5,747 9,337 81,910 107
s03 25 348 6,495 10,858 90,753 108
s04 24 338 6,318 10,889 87,289 100
s05 24 311 6,220 11,133 83,907 107
s06 25 350 6,458 11,496 90,384 112
s07 24 332 6,314 11,340 84,974 94
s08 24 288 6,220 11,714 86,164 107
s09 24 302 6,322 11,831 93,773 99
s10 18 219 5,247 9,345 69,493 78
Total 236 3,107 61,309 108,733 850,100 700

Some utterances include action notes. In the following example, extracted from s01_e01_c01_u028, the speaker is talking to Ross, which is indicated by the action note:

"transcript": "Let me get you some coffee.",
"transcript_with_note": "(to Ross) Let me get you some coffee.",

The followings show the statistics including action notes:

Season ID Utterances Sentences Tokens
s01 6,626 12,088 100,773
s02 6,048 10,565 97,763
s03 7,267 12,288 117,912
s04 7,119 12,811 116,703
s05 7,082 13,540 118,509
s06 7,235 13,506 120,471
s07 7,019 13,363 116,341
s08 6,845 13,321 109,984
s09 6,653 13,548 119,090
s10 5,479 11,029 93,390
Total 67,373 126,059 1,110,936

Documentations

References

Contact

More Repositories

1

nlp4j

NLP framework for JVM languages.
Java
148
star
2

nlp4j-old

NLP tools developed by Emory University.
Java
60
star
3

FriendsQA

Question answering on multiparty dialogue
Python
43
star
4

character-identification

Entity linking of personal mentions in multiparty dialogue.
Python
38
star
5

ud-korean

Universal Dependency Treebanks in Korean
37
star
6

elit

Emory Language and Information Toolkit
Python
37
star
7

coref-hoi

Coreference resolution with different higher-order inference methods; implemented in PyTorch.
Python
35
star
8

personality-detection

Personality detection on multiparty dialogue
35
star
9

emotion-detection

Emotion detection on multiparty dialogue.
34
star
10

nlprankings

Ranking of Top Institutes for Natural Language Processing (NLP)
TeX
21
star
11

bert-2019

Python
19
star
12

ddr

Deep Dependency Representation
15
star
13

selqa

Selection-based Question Answering
13
star
14

seq2seq-corenlp

Python
13
star
15

semeval-2018-task4

SemEval 2018 Task 4: Character Identification on Multiparty Dialogues
Python
12
star
16

levi-graph-amr-parser

Python
10
star
17

character-mining-old

Character mining.
Java
10
star
18

ChatEvaluationPlatform

JavaScript
9
star
19

dependable

Web-based Visualization and Evaluation Tool for Dependency Parsing
JavaScript
8
star
20

align4d

C++
7
star
21

MRL-2021

English-Korean Parallel Dataset
Python
6
star
22

stem-cell-hypothesis

Python
6
star
23

nlp4j-tokenization

Tokenize raw texts into tokens and sentences.
Java
6
star
24

StreamSide

Meaning Representation Annotation Toolkit
Python
5
star
25

character-identification-old

Python
5
star
26

reading-comprehension

Reading comprehension on multiparty dialog.
4
star
27

iwpt-shared-task-2020

Shared Task on Enhanced Universal Dependencies
Python
4
star
28

nlp-research

https://www.emorynlp.org
3
star
29

korean-nlp

Korean Universal Dependency.
CSS
3
star
30

text_analysis

Vector space models.
Java
3
star
31

EnrichedAMR

Enriched Abstract Meaning Representations (AMR)
Python
3
star
32

doc-classify

Document classification.
Python
3
star
33

qa-demo

Question Answering Demo
Python
3
star
34

reddit-to-dialogue

Converting Reddit posts and their comments to 1-1 dialogues
Python
3
star
35

LAW-2022-Causal

A Cognitive Approach to Annotating Causal Constructions in a Cross-Genre Corpus
Python
3
star
36

nlp4j-morphology

Morphological analysis: lemmatization.
Java
2
star
37

CMCL-2021

Emotion detection model presented at CMCL 2021
Python
2
star
38

ResumeMatching

Python
2
star
39

wiser

Widely Interpretable Semantic Representation
2
star
40

swne

Switchboard Named Entity Corpus
Python
2
star
41

TranscribeView

Python
2
star
42

FantasyCoref

Coreference Resolution on Fantasy Literature
2
star
43

qa-it

Classification of Non-referential It on Question Answer Pairs
1
star
44

techradar-reviews

Reviews collected from TechRadar
Python
1
star
45

tabernacle

Entity annotation tool.
Python
1
star
46

nlp4j-example

Demo programs for NLP4J.
Java
1
star
47

nlp4j-coreference

Coreference Resolution for NLP4J.
Java
1
star
48

reddit-college

College related subreddits
Python
1
star
49

nlp4j-core

Core NLP components.
Java
1
star