Character Mining
The Character Mining project challenges machine comprehension on multiparty dialogue. The objective of this project is to infer explicit and implicit contexts about individual characters through their conversations. This is an open-source project led by the Emory NLP research group that provides resources for the following tasks:
- Character Identification (since May 2016).
- Emotion Detection (since May 2017).
- Reading Comprehension (since May 2018).
- Questiong Answering (since May 2019).
- Personality Detection (since Sep 2019).
We welcome feedbacks and contributions from the community. Most of our annotation are crowdsourced; implying that, errors are expected to be found. Please make pull requests if you wish to fix errors in our datasets.
Dataset
Our dataset is based on the popular TV show called Friends. Transcripts for all 10 seasons of the show as well as manual and crowdsourced annotation for subparts of the show are provided. All text data are available in the JSON files; please visit the individual task pages to retrieve datasets specifically designed for those tasks.
Statistics
Each season consists of episodes, each episode is divided into scenes, each scene comprises utterances, each utterance is a list of sentences where tokens are split.
Season ID | Episodes | Scenes | Utterances | Sentences | Tokens | Speakers |
---|---|---|---|---|---|---|
s01 | 24 | 326 | 5,968 | 10,790 | 81,453 | 107 |
s02 | 24 | 293 | 5,747 | 9,337 | 81,910 | 107 |
s03 | 25 | 348 | 6,495 | 10,858 | 90,753 | 108 |
s04 | 24 | 338 | 6,318 | 10,889 | 87,289 | 100 |
s05 | 24 | 311 | 6,220 | 11,133 | 83,907 | 107 |
s06 | 25 | 350 | 6,458 | 11,496 | 90,384 | 112 |
s07 | 24 | 332 | 6,314 | 11,340 | 84,974 | 94 |
s08 | 24 | 288 | 6,220 | 11,714 | 86,164 | 107 |
s09 | 24 | 302 | 6,322 | 11,831 | 93,773 | 99 |
s10 | 18 | 219 | 5,247 | 9,345 | 69,493 | 78 |
Total | 236 | 3,107 | 61,309 | 108,733 | 850,100 | 700 |
Some utterances include action notes.
In the following example, extracted from s01_e01_c01_u028
, the speaker is talking to Ross, which is indicated by the action note:
"transcript": "Let me get you some coffee.",
"transcript_with_note": "(to Ross) Let me get you some coffee.",
The followings show the statistics including action notes:
Season ID | Utterances | Sentences | Tokens |
---|---|---|---|
s01 | 6,626 | 12,088 | 100,773 |
s02 | 6,048 | 10,565 | 97,763 |
s03 | 7,267 | 12,288 | 117,912 |
s04 | 7,119 | 12,811 | 116,703 |
s05 | 7,082 | 13,540 | 118,509 |
s06 | 7,235 | 13,506 | 120,471 |
s07 | 7,019 | 13,363 | 116,341 |
s08 | 6,845 | 13,321 | 109,984 |
s09 | 6,653 | 13,548 | 119,090 |
s10 | 5,479 | 11,029 | 93,390 |
Total | 67,373 | 126,059 | 1,110,936 |
Documentations
- How to retrieve information from the JSON files:
load_json.ipynb
.
References
- Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-based Question Answering. Changmao Li and Jinho D. Choi. In Proceedings of the Conference of the Association for Computational Linguistics, ACL'20, 2020.
- Modeling Personality with Attentive Networks and Contextual Embeddings. Hang Jiang, Xianzhe Zhang, and Jinho D. Choi. In Proceedings of the AAAI Student Abstract and Poster Program, AAAI:SAP'20, 2020 (poster).
- FriendsQA: Open-Domain Question Answering on TV Show Transcripts. Zhengzhe Yang and Jinho D. Choi. In Proceedings of the Annual Conference of the ACL Special Interest Group on Discourse and Dialogue, SIGDIAL'19, 2019 (slides).
- They Exist! Introducing Plural Mentions to Coreference Resolution and Entity Linking. Ethan Zhou and Jinho D. Choi. In Proceedings of the 27th International Conference on Computational Linguistics, COLING'18, 2018 (slides).
- SemEval 2018 Task 4: Character Identification on Multiparty Dialogues, Jinho D. Choi and Henry Y. Chen, Proceedings of the International Workshop on Semantic Evaluation, SemEval'18, 2018 (slides).
- Challenging Reading Comprehension on Daily Conversation: Passage Completion on Multiparty Dialog. Kaixin Ma, Tomasz Jurczyk, and Jinho D. Choi. In Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL'18, 2018 (poster, source).
- Emotion Detection on TV Show Transcripts with Sequence-based Convolutional Neural Networks. Sayyed Zahiri and Jinho D. Choi. In The AAAI Workshop on Affective Content Analysis, AFFCON'18, 2018.
- Cross-domain Document Retrieval: Matching between Conversational and Formal Writings. Tomasz Jurczyk and Jinho D. Choi. In Proceedings of the EMNLP Workshop on Building Linguistically Generalizable NLP Systems, of BLGNLP'17, 2017 (slides).
- Robust Coreference Resolution and Entity Linking on Dialogues: Character Identification on TV Show Transcripts, Henry Y. Chen, Ethan Zhou, and Jinho D. Choi. Proceedings of the 21st Conference on Computational Natural Language Learning, CoNLL'17, 2017 (slides).
- Text-based Speaker Identification on Multiparty Dialogues Using Multi-document Convolutional Neural Networks. Kaixin Ma, Catherine Xiao, and Jinho D. Choi. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, ACL:SRW'17, 2017 (poster).
- Character Identification on Multiparty Conversation: Identifying Mentions of Characters in TV Shows, Henry Y. Chen and Jinho D. Choi. Proceedings of the 17th Annual SIGdial Meeting on Discourse and Dialogue, SIGDIAL'16, 2016 (poster).