• Stars
    star
    354
  • Rank 120,021 (Top 3 %)
  • Language
  • Created about 7 years ago
  • Updated almost 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A corpus of 100,000 happy moments

HappyDB :)

HappyDB is a corpus of 100,000+ crowd-sourced happy moments. The goal of the corpus is to advance the state of the art of understanding the causes of happiness that can be gleaned from text. Please see also our web site

Data Collection

We conducted a large scale collection of happy moments over 3 months on Amazon Mechanical Turk (MTurk.) For every task, we asked the MTurk workers to describe 3 happy moments in the past 24 hours (or past 3 months.)

Here are the instructions of the data collection task.

What made you happy today?  Reflect on the past {24 hours|3 months}, and recall three actual events 
that happened to you that made you happy.  Write down your happy moment in a complete sentence.
Write three such moments.

Examples of happy moments we are NOT looking for (e.g events in distant past, partial sentence):
    - The day I married my spouse
    - My Dog

HappyDB Statistics

Basic statistics of HappyDB are shown in the table.

Collection period 3/28/2017 - 6/16/2017
# happy moments 100,922
# distinct users 10,843
# distinct words 38,188
Avg. # happy moments / user 9.31
Avg. # words / happy moment 19.66

How to Download

You can download the dataset by using git command or simply downloading the file from the repository.

$ git clone <git-repository-path>

Directory Structure

After you clone or download the repository, you will see the following file structure.

happydb
โ””โ”€โ”€ data
    โ”œโ”€โ”€ cleaned_hm.csv
    โ”œโ”€โ”€ demographic.csv
    โ”œโ”€โ”€ original_hm.csv
    โ”œโ”€โ”€ senselabel.csv
    โ”œโ”€โ”€ topic_dict
    โ”‚   โ”œโ”€โ”€ entertainment-dict.csv
    โ”‚   โ”œโ”€โ”€ exercise-dict.csv
    โ”‚   โ”œโ”€โ”€ family-dict.csv
    โ”‚   โ”œโ”€โ”€ food-dict.csv
    โ”‚   โ”œโ”€โ”€ people-dict.csv
    โ”‚   โ”œโ”€โ”€ pets-dict.csv
    โ”‚   โ”œโ”€โ”€ school-dict.csv
    โ”‚   โ”œโ”€โ”€ shopping-dict.csv
    โ”‚   โ””โ”€โ”€ work-dict.csv
    โ””โ”€โ”€ vad.csv

HappyDB consists of a set of CSV files. Here are schema descriptions of the files.

cleaned_hm.csv

cleaned_hm.csv contains cleaned-up happy moments and some additional information in addition to original happy moments.

  • hmid (int): Happy moment ID
  • wid (int): Worker ID
  • reflection_period (str): Reflection period used in the instructions provided to the worker (3m or 24h)
  • original_hm (str): Original happy moment
  • cleaned_hm (str): Cleaned happy moment
  • modified (bool): If True, original_hm is "cleaned up" to generate cleaned_hm (True or False)
  • predicted_category (str): Happiness category label predicted by our classifier (7 categories. Please see the reference for details)
  • ground_truth_category (str): Ground truth category label. The value is NaN if the ground truth label is missing for the happy moment
  • num_sentence (int): Number of sentences in the happy moment

original_hm.csv

original_hm.csv contains unfiltered version of happy moments.

  • hmid (int): Happy moment ID
  • wid (int): Worker ID
  • hm (str): Original happy moment
  • reflection_period (str): Reflection period used in the instructions provided to the worker (3m or 24h)

demographic.csv

demographic.csv contains demographic information of the workers who contributed to the happy moment collection.

  • wid (int): Worker ID
  • age (float): Age
  • country (str): Country of residence (follows the ISO 3166 Country Code)
  • gender (str): {Male (m), Female (f), Other (o)}
  • marital (str): Marital status {single, married, divorced, separated, or widowed}
  • parenthood (str): Parenthood status {yes (y) or no (n)}

senselabel.csv

senselabel.csv contains multi-word expression and supsersense tags on "cleaned" happy moments. Thus, the number of rows are exactly same as that of cleaned_hm.csv.

  • hmid (int): Happy moment ID
  • tokenOffset (int): Position index of a token
  • word (str): Token in the original form
  • lowercaseLemma (str): Lemmatized token in lowercase
  • POS (str): Part-of-Speech tag
  • MWE (str): Multi-word expression (MWE) tag in the extended IOB style (See [REF] for further information)
  • offsetParent (int): The beginning position of a multi-word expression
  • supersenseLabel (str): Supersense classes defined in the WordNet (19 verb and 25 noun classes. See [REF] for further information.)

Here is an example of annotated happy moments. A sentence is tokenized into words. You may see that "got" in the 2nd position is lemmatized as "get" in the lowercaseLemma column. The supersenseLabel column shows the supersense labels of each word or multi-word expression.

    hmid  tokenOffset      word lowercaseLemma    POS MWE  offsetParent supersenseLabel
0  70027            1         I              i   PRON   O             0             NaN
1  70027            2       got            get   VERB   B             0        v.motion
2  70027            3        my             my   PRON   o             0             NaN
3  70027            4       car            car   NOUN   o             0      n.artifact
4  70027            5     waxed          waxed   NOUN   I             2             NaN
5  70027            6       and            and   CONJ   O             0             NaN
6  70027            7  polished         polish   VERB   O             0        v.motion
7  70027            8         .              .  PUNCT   O             0             NaN

topic_dict/*-dict.csv

happydb/data/topic_dict contains the topic (e.g., entertainment, exercise etc.) dictionaries that are manually prepared for analyzing topics of happy moments. Each file contains keywords that belong to the topic.

Getting started with the dataset

All files (except for *-dict.scv files) follow the CSV-format. Each programming language should have some library that has CSV loading function. In Python, pandas is a common library for handling data.

Here are some examples looking at data with pandas library.

>>> import pandas as pd
>>> df = pd.read_csv("happydb/data/cleaned_hm.csv")
>>> df["cleaned_hm"].head()

0    I went on a successful date with someone I fel...
1    I was happy when my son got 90% marks in his e...
2         I went to the gym this morning and did yoga.
3    We had a serious talk with some friends of our...
4    I went with grandchildren to butterfly display...
Name: cleaned_hm, dtype: object

>>> df["predicted_cattegor"โ€].value_counts
affection           34206
achievement         34044
enjoy_the_moment    11202
bonding             10729
leisure              7505
nature               1846
exercise             1206
Name: predicted_category, dtype: int64

We plan to release sample scripts and iPython (Jupyter) notebooks for the HappyDB soon!

References

Please cite the following publication if you use the dataset in your work.

Akari Asai, Sara Evensen, Behzad Golshan, Alon Halevy, Vivian Li, Andrei Lopatenko, 
Daniela Stepanov, Yoshihiko Suhara, Wang-Chiew Tan, Yinzhan Xu, 
``HappyDB: A Corpus of 100,000 Crowdsourced Happy Moments'', LREC '18, May 2018. (to appear)

Contact

Please ask us questions at [email protected].

More Repositories

1

ginza

A Japanese NLP Library using spaCy as framework based on Universal Dependencies
Python
727
star
2

ditto

Code for the paper "Deep Entity Matching with Pre-trained Language Models"
Python
233
star
3

bunkai

Sentence boundary disambiguation tool for Japanese texts (ๆ—ฅๆœฌ่ชžๆ–‡ๅขƒ็•Œๅˆคๅฎšๅ™จ)
Python
177
star
4

sato

Code and data for Sato https://arxiv.org/abs/1911.06311.
Python
107
star
5

jrte-corpus

Japanese Realistic Textual Entailment Corpus (NLP 2020, LREC 2020)
Python
75
star
6

opiniondigest

OpinionDigest: A Simple Framework for Opinion Summarization (ACL 2020)
Python
56
star
7

vecscan

Python
49
star
8

SubjQA

A question-answering dataset with a focus on subjective information
40
star
9

t5-japanese

Codes to pre-train Japanese T5 models
Python
39
star
10

ruler

Data Programming by Demonstration (DPBD) for Document Classification
Jupyter Notebook
36
star
11

tagruler

Data programming by demonstration for information extraction and span annotation
JavaScript
35
star
12

coop

โ˜˜๏ธ Code for Convex Aggregation for Opinion Summarization (Iso et al; Findings of EMNLP 2021)
Python
31
star
13

doduo

Annotating Columns with Pre-trained Language Models
Python
25
star
14

asdc

Accommodation Search Dialog Corpus (ๅฎฟๆณŠๆ–ฝ่จญๆŽข็ดขๅฏพ่ฉฑใ‚ณใƒผใƒ‘ใ‚น)
Python
23
star
15

instruction_ja

Japanese instruction data (ๆ—ฅๆœฌ่ชžๆŒ‡็คบใƒ‡ใƒผใ‚ฟ)
Python
21
star
16

rotom

Code for the paper "Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond"
Roff
21
star
17

cocosum

๐Ÿฅฅ Code & Data for Comparative Opinion Summarization via Collaborative Decoding (Iso et al; Findings of ACL 2022)
Python
20
star
18

ebe-dataset

Evidence-based Explanation Dataset (AACL-IJCNLP 2020)
PLSQL
17
star
19

ginza-transformers

Use custom tokenizers in spacy-transformers
Python
17
star
20

teddy

Code and data for Teddy https://arxiv.org/abs/2001.05171.
Python
15
star
21

zett

๐Ÿ™ˆ Code for Zero-shot Triplet Extraction by Template Infilling (Kim et al; IJCNLP-AACL 2023)
Python
15
star
22

machamp

The dataset for the paper "Machamp: A Generalized Entity Matching Benchmark" published in CIKM 2021
14
star
23

starmie

Resources for PVLDB 2023 submission
Python
14
star
24

meganno-client

Python
7
star
25

sudowoodo

The source code of the Sudowoodo paper in ICDE 2023
Jupyter Notebook
7
star
26

explainit

Python
5
star
27

desuwa

Feature annotator to morphemes and phrases based on KNP rule files (pure-Python)
Emacs Lisp
5
star
28

react-jupyter-cookiecutter

Python
5
star
29

xatu

๐Ÿ•Š๏ธ Code and Data for XATU: A Fine-grained Instruction-based Benchmark for Explainable Text Updates (Zhang et al; LREC-COLING 2024)
Python
4
star
30

magneton

Repository of the Magneton framework for authoring interaction-aware and customizable widgets.
TypeScript
4
star
31

emu

Enhancing Multilingual Sentence Embeddings with Semantic Specialization (AAAI '20)
4
star
32

learnit

A Tool for Machine Learning Beginners
Python
4
star
33

leam

Source code and demo for Leam
Jupyter Notebook
3
star
34

minun

Evaluating Counterfactual Explanations for Entity Matching
Python
3
star
35

llm-longeval

๐Ÿ’ต Code for Less is More for Long Document Summary Evaluation by LLMs (Wu, Iso et al; EACL 2024)
Python
3
star
36

jrte-corpus_example

Example codes for Japanese Realistic Textual Entailment Corpus
Python
3
star
37

Tyrogue

Jupyter Notebook
2
star
38

qa-summarization

Ting-Yao's intern project
Python
2
star
39

pilota

โœˆ SCUD generator (่งฃ้‡ˆๆ–‡็”Ÿๆˆๅ™จ)
Python
1
star
40

quasi_japanese_reviews

Quasi Japanese Reviews (ๆ“ฌไผผใƒฌใƒ“ใƒฅใƒผใƒ‡ใƒผใ‚ฟ)
Python
1
star
41

MCR

1
star
42

witqa

1
star