Character Mining

The Character Mining project challenges machine comprehension on multiparty dialogue. The objective of this project is to infer explicit and implicit contexts about individual characters through their conversations. This is an open-source project led by the Emory NLP research group that provides resources for the following tasks:

Character Identification (since May 2016).
Emotion Detection (since May 2017).
Reading Comprehension (since May 2018).
Questiong Answering (since May 2019).
Personality Detection (since Sep 2019).

We welcome feedbacks and contributions from the community. Most of our annotation are crowdsourced; implying that, errors are expected to be found. Please make pull requests if you wish to fix errors in our datasets.

Dataset

Our dataset is based on the popular TV show called Friends. Transcripts for all 10 seasons of the show as well as manual and crowdsourced annotation for subparts of the show are provided. All text data are available in the JSON files; please visit the individual task pages to retrieve datasets specifically designed for those tasks.

Statistics

Each season consists of episodes, each episode is divided into scenes, each scene comprises utterances, each utterance is a list of sentences where tokens are split.

Season ID	Episodes	Scenes	Utterances	Sentences	Tokens	Speakers
s01	24	326	5,968	10,790	81,453	107
s02	24	293	5,747	9,337	81,910	107
s03	25	348	6,495	10,858	90,753	108
s04	24	338	6,318	10,889	87,289	100
s05	24	311	6,220	11,133	83,907	107
s06	25	350	6,458	11,496	90,384	112
s07	24	332	6,314	11,340	84,974	94
s08	24	288	6,220	11,714	86,164	107
s09	24	302	6,322	11,831	93,773	99
s10	18	219	5,247	9,345	69,493	78
Total	236	3,107	61,309	108,733	850,100	700

Some utterances include action notes. In the following example, extracted from s01_e01_c01_u028, the speaker is talking to Ross, which is indicated by the action note:

"transcript": "Let me get you some coffee.",
"transcript_with_note": "(to Ross) Let me get you some coffee.",

The followings show the statistics including action notes:

Season ID	Utterances	Sentences	Tokens
s01	6,626	12,088	100,773
s02	6,048	10,565	97,763
s03	7,267	12,288	117,912
s04	7,119	12,811	116,703
s05	7,082	13,540	118,509
s06	7,235	13,506	120,471
s07	7,019	13,363	116,341
s08	6,845	13,321	109,984
s09	6,653	13,548	119,090
s10	5,479	11,029	93,390
Total	67,373	126,059	1,110,936

Documentations

How to retrieve information from the JSON files: load_json.ipynb.

References

Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-based Question Answering. Changmao Li and Jinho D. Choi. In Proceedings of the Conference of the Association for Computational Linguistics, ACL'20, 2020.
Modeling Personality with Attentive Networks and Contextual Embeddings. Hang Jiang, Xianzhe Zhang, and Jinho D. Choi. In Proceedings of the AAAI Student Abstract and Poster Program, AAAI:SAP'20, 2020 (poster).
FriendsQA: Open-Domain Question Answering on TV Show Transcripts. Zhengzhe Yang and Jinho D. Choi. In Proceedings of the Annual Conference of the ACL Special Interest Group on Discourse and Dialogue, SIGDIAL'19, 2019 (slides).
They Exist! Introducing Plural Mentions to Coreference Resolution and Entity Linking. Ethan Zhou and Jinho D. Choi. In Proceedings of the 27th International Conference on Computational Linguistics, COLING'18, 2018 (slides).
SemEval 2018 Task 4: Character Identification on Multiparty Dialogues, Jinho D. Choi and Henry Y. Chen, Proceedings of the International Workshop on Semantic Evaluation, SemEval'18, 2018 (slides).
Challenging Reading Comprehension on Daily Conversation: Passage Completion on Multiparty Dialog. Kaixin Ma, Tomasz Jurczyk, and Jinho D. Choi. In Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL'18, 2018 (poster, source).
Emotion Detection on TV Show Transcripts with Sequence-based Convolutional Neural Networks. Sayyed Zahiri and Jinho D. Choi. In The AAAI Workshop on Affective Content Analysis, AFFCON'18, 2018.
Cross-domain Document Retrieval: Matching between Conversational and Formal Writings. Tomasz Jurczyk and Jinho D. Choi. In Proceedings of the EMNLP Workshop on Building Linguistically Generalizable NLP Systems, of BLGNLP'17, 2017 (slides).
Robust Coreference Resolution and Entity Linking on Dialogues: Character Identification on TV Show Transcripts, Henry Y. Chen, Ethan Zhou, and Jinho D. Choi. Proceedings of the 21st Conference on Computational Natural Language Learning, CoNLL'17, 2017 (slides).
Text-based Speaker Identification on Multiparty Dialogues Using Multi-document Convolutional Neural Networks. Kaixin Ma, Catherine Xiao, and Jinho D. Choi. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, ACL:SRW'17, 2017 (poster).
Character Identification on Multiparty Conversation: Identifying Mentions of Characters in TV Shows, Henry Y. Chen and Jinho D. Choi. Proceedings of the 17th Annual SIGdial Meeting on Discourse and Dialogue, SIGDIAL'16, 2016 (poster).

Contact

Jinho D. Choi.

emorynlp/character-mining

emorynlp

Reviews

Repository Details