Word Vector Representation for Korean
Subword-level Word Vector Representations for Korean
Sungjoon Park, Jeongmin Byun, Sion Baek, Yongseok Cho, Alice Oh
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018)
Abstract
Research on distributed word representations is focused on widely-used languages such as English. Although the same methods can be used for other languages, language-specific knowledge can enhance the accuracy and richness of word vector representations. In this paper, we look at improving distributed word representations for Korean using knowledge about the unique linguistic structure of Korean. Specifically, we decompose Korean words into the jamo-level, beyond the character-level, allowing a systematic use of subword information. To evaluate the vectors, we develop Korean test sets for word similarity and analogy and make them publicly available. The results show that our simple method outperforms word2vec and character-level Skip-Grams on semantic and syntactic similarity and analogy tasks and contributes positively toward down-stream NLP tasks such as sentiment analysis.
Dataset
We open our evaluation dataset for Korean word vectors. Details are described below. We plan to develop more evaluation sets for Korean NLP communities, so any comments for theses sets or collaboration for constructing other sets are welcome!
1. WS-353 for word similarity (Korean)
- 2 graduate students translated original (English) set.
- 14 native Korean speakers participated in evaluation of the set.
- Excluded the minimum and maximum scores and compute the mean of the rest of the scores.
- .82 correlation with original English set.
- Some of words are replaced by more familiar words to Korean. ( e.g.) Arafat -> μμ€κ·Ό )
2. Word Analogies (Korean)
- 10,000 items. 5,000 for semantic and 5,000 syntactic items.
- 5 categories for semantic and syntactic features.
- Each category contains 1,000 items.
- Syntactic Features (with an example) :
- Case : μλμ°¨ μλμ°¨λ₯Ό μΈν°λ· μΈν°λ·μ
- Tense κ°λ€ κ°λ€ 곡λΆνλ€ κ³΅λΆνλ€
- Voice κ°λ€ κ°λ¦¬λ€ κ±°λνλ€ κ±°λλλ€
- Verb form κ°λ€ κ°κ³ λλ€ λκ³
- Honorific κ°λ€ κ°μλ€ κ³΅λΆνλ€ κ³΅λΆνμλ€
- Semantic Features (with an example):
- Capital-countries μν λ€ κ·Έλ¦¬μ€ λ°κ·Έλ€λ μ΄λΌν¬
- male-female λ¨μ μ¬μ μλ²μ§ μ΄λ¨Έλ
- name-nationality κ°λ μΈλ λν΄λ μΉ νλμ€
- country-language μλ₯΄ν¨ν°λ μ€νμΈμ΄ λ―Έκ΅ μμ΄
- misc κ° κ°μμ§ μ μ‘μμ§
Korean word vector representation learning
Building fastText for Korean
Before you start training Korean word vectors, you should build the source of subword-level Korean word vectors (a.k.a., Korean FastText) by using make
.
$ cd src
$ make
This will produce object files for all the classes as well as the main binary fasttext
.
1. Parse Korean documents.
First, you should parse a Korean document with decompose_letters.py
This file will decompose original Korean letters in the document, generating a parsed document. The parsed document will be used as training data of the vectors. An example use case is as follows:
python decompose_letters.py [input_file_name] [parsed_file_name]
2. Train Korean word vectors.
Then, you can train subword-level word vectors for Korean. The source code depends on the implementation of FastText
. Thus you can execute the complied source as like the original FastText
. Note that the source code will accept the output file [parsed_file_name]
generated by decompose_letters.py
. An example use case is as follows:
[fastText_executable_path] skipgram -input [parsed_file_name] -output [output_file_name] -minCount 10 -minjn 3 -maxjn 5 -minn 1 -maxn 4 -dim 300 -ws 5 -epoch 5 -neg 5 -loss ns -thread 16
The full list of parameters are given below.
-minCount : minimal number of word occurences [5]
-bucket : number of buckets [10000000]
-minn : min length of char ngram [1]
-maxn : max length of char ngram [4]
-minjn : min length of jamo ngram [3]
-maxjn : max length of jamo ngram [5]
-emptyjschar : empty jongsung symbol ["e"]
-t : sampling threshold [1e-4]
-lr : learning rate [0.05]
-dim : size of word vectors [100]
-ws : size of the context window [5]
-loss : loss function {ns, hs, softmax} [softmax]
-neg : number of negatives sampled [5]
-epoch : number of epochs [5]
-thread : number of threads [12]
-verbose : verbosity level [2]
As written in the paper, the default number of character-level n-grams is set to 1-4, and the number of jamo-level n-grams is set to 3-5. As the number of n-grams increases, you should adjust the number of maximum unique n-grams (bucket), otherwise some n-grams will be overridden. We recommend 10,000,000 for approximately 3GB of (parsed) Korean corpus.
Constructing Korean OOV word vectors
The trained output file [output_file_name].bin
can be used to compute word vectors for OOVs. Provided you have a text file queries.txt
containing Korean decomposed words for which you want to compute vectors, use the following command:
$ [fastText_executable_path] print-word-vectors model.bin < queries.txt
Note that queries.txt
should contain decomposed Korean words, such as γ±γ
γ
γ
γ
eγ
γ
£e for κ°μμ§. You can also use jamo_split
method in decompose_letters.py
to obtain decomposed Korean words.
Reference
Please cite the followings if using this code for learning word representations for Korean or evaluating word vectors using the evaluation sets.
@inproceedings{park-etal-2018-subword,
title = "Subword-level Word Vector Representations for {K}orean",
author = "Park, Sungjoon and
Byun, Jeongmin and
Baek, Sion and
Cho, Yongseok and
Oh, Alice",
booktitle = "Proceedings of the 56th Annual Meeting of the ACL",
year = "2018",
pages = "2429--2438"
}
Change Log
01-11-19 : Add implementations. version 1.0 05-04-18 : Initial upload of datasets. version 1.0