• Stars
    star
    100
  • Rank 338,634 (Top 7 %)
  • Language
    Python
  • Created over 8 years ago
  • Updated almost 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Japanese Word Similarity Dataset

Japanese Word Similarity Dataset

Data

We built a Japanese word similarity dataset including rare words.

We target verb, adjective, noun and adverb.

Data construction

We constructed our dataset following the Stanford Rare Word Similarity Dataset (RW) proposed by Luong et al. (2013).

We extracted pairs of Japanese verbs (including sahen verb) and adjectives (both i-adjective and na-adjective) from Kodaira et al. (2016)'s Evaluation dataset for Japanese lexical simplification.

We employed a crowdsourcing service (Lancers) to recruite 10 annotators to assign 11 levels of similarity for word pairs.

0 (most dissimilar) - 10 (most similar)

Entry

The sample of the dataset is as follows:

word1 word2 mean(remove_extreme_annotator) sub1 sub2 ... sub9 sub10 mean
ζŽ’ι™€γ™γ‚‹ 焑視する 4.6 5 3 ... 5 6 4.8
ζŽ’ι™€γ™γ‚‹ 陀倖する 6.6 7 6 ... 5 7 6.8

mean(remove_extreme_annotator) : average of the similarity scores assigned by annotators(the annotator attached an extreme value are removed)

mean : average of the similarity scores assigned by annotators

sub* : the similarity score for each annotator

Helper script in src

The src directory contains a helper script to calculate Spearman's rank correlation coefficient used in our LREC paper.

Specifically, we learned word vectors from Japanese Wikipedia to calculatethe rank correlation coefficient between the similarity of word pairs and mean of annotated scores.

License

Our work is licensed under Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0).

Citation

If you use our dataset, please cite our LREC paper.

  1. Yuya Sakaizawa and Mamoru Komachi. Construction of a Japanese Word Similarity Dataset. In 11th edition of the Language Resources and Evaluation Conference (LREC 2018), pp.948-951. May 2018.
  2. Yuya Sakaizawa and Mamoru Komachi. Construction of a Japanese Word Similarity Dataset. In arXiv e-prints, 1703.05916 (5 pages). March 2017.

References

  1. Tomonori Kodaira, Tomoyuki Kajiwara, Mamoru Komachi. Controlled and Balanced Dataset for Japanese Lexical Simplification. ACL 2016 Student Research Workshop, pp.1-7. August 2016.
  2. Minh-thang Luong, Richard Socher, Christopher D. Manning. Better Word Representations with Recursive Neural Networks for Morphology. CoNLL 2013, pp.104-113. August 2013.

Tokyo Metropolitan University

Yuya Sakaizawa

e-mail: ksf.doingmorewithless-at-gmail.com


More Repositories

1

simple-jppdb

A paraphrase database for Japanese text simplification
Python
32
star
2

TwitterCorpus

首都倧ζ—₯本θͺž Twitter コーパス
Python
21
star
3

sscorpus

A monolingual parallel corpus for sentence simplification
11
star
4

paraphrase-corpus

Tokyo Metropolitan University Paraphrase Corpus (TMUP)
11
star
5

ThaiToxicityTweetCorpus

Jupyter Notebook
10
star
6

100knock2021

Jupyter Notebook
9
star
7

NLPtutorial2021

Python
6
star
8

nccp

Neural Combinatory Constituency Parsing | ACL2021 Findings | Tokyo Metropolitan University | Natural Language Processing Group (Komachi Lab)
Python
6
star
9

DistantTermExtractor

Python
5
star
10

100knock2015

http://cl.sd.tmu.ac.jp/groups/programming-drill
Python
5
star
11

NLPtutorial2022

Python
5
star
12

100knock2022

Jupyter Notebook
5
star
13

100knock2018

Python
4
star
14

pmi-ppdb

MIPA: Mutual Information Based Paraphrase Acquisition via Bilingual Pivoting
4
star
15

100knock2023

Jupyter Notebook
4
star
16

100knock2024

Jupyter Notebook
4
star
17

SEEDA

Python
4
star
18

CHASM

Corpus of automatically generated counternarratives
3
star
19

JapaneseLSCDataset

3
star
20

NLPtutorial2018

Python
3
star
21

sicp2014

Solutions to the Structure and Interpretation of Computer Programs (MIT Press, second edition, 1996)
Scheme
3
star
22

100knock2016

Python
3
star
23

UniTP

Neural Combinatory Constituency Parsing | ACL2021 Findings | TACL 2023 | Tokyo Metropolitan University | Natural Language Processing Group (Komachi Lab)
Python
3
star
24

sentiment-treebank

Tokyo Metropolitan University Sentiment Treebank (TMUST)
Python
3
star
25

100knock2019

θ‡ͺ焢言θͺžε‡¦η†100ζœ¬γƒŽγƒƒγ‚―ε‹‰εΌ·δΌš
Python
2
star
26

JADOS

Python
2
star
27

100knock2017

Python
2
star
28

MTEval4GV

MTEvaluationWithContextData
1
star
29

NLPtutorial2016

Python
1
star
30

NMT2016

Python
1
star
31

NLPtutorial2019

Python
1
star
32

100knock2020

Python
1
star