hangul-utils
An integrated library for Korean language preprocessing.
This simple library has following features.
- Text Normalization: typo and error correction using
open-korean-text
. - Tokenization: sentence and word-level tokenizations using
Mecab-ko
. - Character Manipulation: splitting and combining of jamo characters (our own implementation)
Python 2 is no longer supported.
Getting Started
For text normalization, Open Korean Text is required. For tokenizations, Mecab-ko is required.
First, run install_mecab_ko.sh
with sudo to install Mecab-ko system-wide.
sudo bash install_mecab_ko.sh
Note that LD_LIBRARY_PATH
must be set to point to /usr/local/lib:/usr/lib
The script above will set that for you temporarily, but you must set it yourself after a restart.
Then install Open Korean Text Python Wrapper by running
bash install_twkorean.sh
Sudo is not required for this one.
Finally install hangul-utils
package by cloning this repo and running
# install from source
python setup.py install
Optionally, you could install the package from pypi, but it is not recommended, as some of the required packages install properly only when installed from git repositories.
Text Normalization
Text normalization is necessary for reducing noises in texts collected online or transcribed from spoken language. To this date, the only open-sourced library that tackles the problem is Open Korean Text.
Open Korean Text's normalization function is not meant to deal with all error cases. The source code indicates that the normalization process largely focuses on fixing typing errors and deals less on linguistic errors itself. The entire process consists of following procedures:
- Removal of repeating jamos that have accidently become the final (e.g. -γ ) of the preceding character: μ΄γ γ γ γ -> μγ γ γ γ
- Shortening of jamos that repeat more than twice: γ γ γ γ γ γ γ ->γ γ
- Shortening of characters that repeat more than twice: νμ©νμ©νμ© -> νμ©νμ©
- Normalization of frequently shortened words: this mainly includes three cases (-γ΄λ°, -γ΄μ§, -γ΄κ°)
- Normalization of tense jamos: γ ->γ , νλλ° -> νλλ°
- Space normalization: \n\n\n -> \n
Any errors or variations beyond above cases will not be normalized.
Usages
Normalization is available as a Preprocessor
method or as a separate function.
>>> from hangul_utils import Preprocessor
>>> p = Preprocessor()
>>> p.normalize("λΆλ€λΆλ€λΆλ€λΆλ€ λ΄κ° μκ°λ° νκ°λ°γ
γ
γ
γ
")
"λΆλ€λΆλ€ λ΄κ° μκ°μΈλ° νκ°λγ
γ
γ
"
>>> from hangul_utils import normalize
>>> normalize("λΆλ€λΆλ€λΆλ€λΆλ€ λ΄κ° μκ°λ° νκ°λ°γ
γ
γ
γ
")
"λΆλ€λΆλ€ λ΄κ° μκ°μΈλ° νκ°λγ
γ
γ
"
Tokenizations
Sentence and word tokenization methods are available in this library, supported by mecab-ko as the backend. Twitter's Korean text library also provides part-of-speech tokenization methods, but it generally lacks the level of robustness offered by other taggers. For example, a common grammatical mistake people make is to not separate "ν " and "μ" when describing the ability to do something. Twitter's tagger fails to tokenize correctly in such cases:
>>> twitter_tokenize("μκ°κ° ν μ μλμΌμ΄ μμ§.")
[('μκ°', 'Noun'),
('κ°', 'Josa'),
('ν μ', 'Verb'),
('μλ', 'Adjective'),
('μΌμ΄', 'Noun'),
('μμ§', 'Adjective'),
('.', 'Punctuation')]
However, Mecab produces more ideal results for the same sentence due to the way it ignores spaces during part-of-speech analysis:
>>> mecab_tokenize("μκ°κ° ν μ μλμΌμ΄ μμ§.")
[('μκ°', 'NNG'),
('κ°', 'JKS'),
('ν ', 'VV+ETM'),
('μ', 'NNB'),
('μ', 'VV'),
('λ', 'ETM'),
('μΌ', 'NNG'),
('μ΄', 'JKS'),
('μ', 'VA'),
('μ§', 'EF'),
('.', 'SF')]
Additionally, Mecab is known to be much faster.
Mecab also supports SF
part-of-speech to indicate the end of a sentence (on which
our sentence tokenizer is based) unlike Twitter's tagger that classifies all
punctutations and special symbols to a more general Punctutation
tag.
Overall, we found that using Mecab has proven to be much more productive in most of our cases. However, it is entirely possible that other taggers could be more suitable for other cases. We do not claim that our experience will be the same for others.
Usages
Sentence tokenization:
>>> from hangul_utils import sent_tokenize
>>> list(sent_tokenize("κ·Έλ¬λ λ² λ€μμλΌλ λ
μΌ λ³΄λ€ ν λ¨κ³ μμλ€. ν μμ μμ λμ λλ μ μκ° λͺλͺ μλ€."))
['κ·Έλ¬λ λ² λ€μμλΌλ λ
μΌ λ³΄λ€ ν λ¨κ³ μμλ€.', 'ν μμ μμ λμ λλ μ μκ° λͺλͺ μλ€.']
Word tokenization (mainly using space):
>>> from hangul_utils import word_tokenize
>>> list(word_tokenize("κ·Έλ¬λ λ² λ€μμλΌλ λ
μΌ λ³΄λ€ ν λ¨κ³ μμλ€."))
['κ·Έλ¬λ', 'λ² λ€μμλΌλ', 'λ
μΌ', '보λ€', 'ν', 'λ¨κ³', 'μμλ€', '.']
Morpheme tokenization:
>>> from hangul_utils import morph_tokenize
>>> list(morph_tokenize("κ·Έλ¬λ λ² λ€μμλΌλ λ
μΌ λ³΄λ€ ν λ¨κ³ μμλ€."))
['κ·Έλ¬λ', 'λ² λ€μμλΌ', 'λ', 'λ
μΌ', '보λ€', 'ν', 'λ¨κ³', 'μ', 'μ', 'λ€', '.']
Morpheme tokenization with POS:
>>> from hangul_utils import morph_tokenize
>>> list(morph_tokenize("κ·Έλ¬λ λ² λ€μμλΌλ λ
μΌ λ³΄λ€ ν λ¨κ³ μμλ€.", pos=True))
[('ν', 'MM'),
('μμ ', 'NNG'),
('μμ', 'JKB'),
('λ', 'NNG'),
('μ', 'JKB'),
('λ', 'VV'),
('λ', 'ETM'),
('μ μ', 'NNG'),
('κ°', 'JKS'),
('λͺλͺ', 'MM'),
('μ', 'VA'),
('λ€', 'EF'),
('.', 'SF')]
Simultaneous sentence and word tokenization (more efficient than calling each of them in succession):
>>> from hangul_utils import sent_word_tokenize
>>> list(sent_word_tokenize("κ·Έλ¬λ λ² λ€μμλΌλ λ
μΌ λ³΄λ€ ν λ¨κ³ μμλ€. ν μμ μμ λμ λλ μ μκ° λͺλͺ μλ€."))
[['κ·Έλ¬λ', 'λ² λ€μμλΌλ', 'λ
μΌ', '보λ€', 'ν', 'λ¨κ³', 'μμλ€', '.'],
['ν', 'μμ μμ', 'λμ', 'λλ', 'μ μκ°', 'λͺλͺ', 'μλ€', '.']]
Simultaneous sentence and morpheme tokenization (more efficient than calling each of them in succession):
>>> from hangul_utils import sent_morph_tokenize
>>> list(sent_morph_tokenize("κ·Έλ¬λ λ² λ€μμλΌλ λ
μΌ λ³΄λ€ ν λ¨κ³ μμλ€. ν μμ μμ λμ λλ μ μκ° λͺλͺ μλ€."))
[['κ·Έλ¬λ', 'λ² λ€μμλΌ', 'λ', 'λ
μΌ', '보λ€', 'ν', 'λ¨κ³', 'μ', 'μ', 'λ€', '.'],
['ν', 'μμ ', 'μμ', 'λ', 'μ', 'λ', 'λ', 'μ μ', 'κ°', 'λͺλͺ', 'μ', 'λ€', '.']]
Omission of incomplete sentences:
>>> from hangul_utils import sent_tokenize
>>> list(sent_tokenize("κ·Έλ¬λ λ² λ€μμλΌλ λ
μΌ λ³΄λ€ ν λ¨κ³ μμλ€. ν μμ μμ λμ λλ μ μκ°", residual=False))
['κ·Έλ¬λ λ² λ€μμλΌλ λ
μΌ λ³΄λ€ ν λ¨κ³ μμλ€.']
Again, these functions are also available as methods of Preprocessor
.
Manipulating Korean Characters
Hangul is made of basic letters called 'jamo(μλͺ¨)', and thus it is an agglutinative language. As such, a tool for splitting and joining jamos could come very handy when we want to perform character-level Korean text processing. Splitting a Korean character is quite straight-forward: a simple algebraic formula can deduce the unicode value of individual jamo (Wiki). However, the tricky part is forming a string of Hanguls from a string of jamos, because the consonants can be either initials or finals, some form of backtracking is require.
Functions
split_syllables
: converts a string of syllables to a string of jamosjoin_jamos
: converts a string of jamos to a string of syllables
Usages
>>> from hangul_utils import split_syllable_char, split_syllables,
join_jamos
>>> print(split_syllable_char(u"μ"))
('γ
', 'γ
', 'γ΄')
>>> print(split_syllables(u"μλ
νμΈμ"))
γ
γ
γ΄γ΄γ
γ
γ
γ
γ
γ
γ
γ
>>> sentence = u"μ μ§ ν₯μ£½μ λΆμ ν₯ νν₯μ£½μ΄κ³ , λ·μ§ 콩죽μ ν콩 λ¨μ½© 콩죽.μ°λ¦¬ μ§
κΉ¨μ£½μ κ²μ κΉ¨ κΉ¨μ£½μΈλ° μ¬λλ€μ ν콩 λ¨μ½© 콩죽 κΉ¨μ£½ μ£½λ¨ΉκΈ°λ₯Ό μ«μ΄νλλΌ."
>>> s = split_syllables(sentence)
>>> print(s)
γ
γ
γ
γ
γ
£γ
γ
γ
γ
γ
γ
γ±γ
γ
‘γ΄ γ
γ
γΊγ
γ
‘γ΄ γ
γ
γ
γ
γ
γ
γ
γ
γ
γ
γ
γ±γ
γ
£γ±γ
,
γ·γ
γ
γ
γ
£γ
γ
γ
γ
γ
γ
γ±γ
γ
‘γ΄ γ
γ
γ
γ
γ
γ
γ·γ
γ΄γ
γ
γ
γ
γ
γ
γ
γ
γ±.γ
γ
γΉγ
£
γ
γ
£γ
γ²γ
γ
γ
γ±γ
γ
‘γ΄ γ±γ
γ
γ
γ
‘γ΄ γ²γ
γ²γ
γ
γ
γ±γ
γ
£γ΄γ·γ
γ
γ
γΉγ
γ
γ·γ
‘γΉγ
γ
‘γ΄
γ
γ
γ
γ
γ
γ
γ·γ
γ΄γ
γ
γ
γ
γ
γ
γ
γ
γ± γ²γ
γ
γ
γ± γ
γ
γ±γ
γ
γ±γ±γ
£γΉγ
‘γΉ
γ
γ
£γ
γ
γ
γ
γ
γ·γ
γΉγ
.
>>> sentence2 = join_jamos(s)
>>> print(sentence2)
μ μ§ ν₯μ£½μ λΆμ ν₯ νν₯μ£½μ΄κ³ , λ·μ§ 콩죽μ ν콩 λ¨μ½© 콩죽.μ°λ¦¬ μ§ κΉ¨μ£½μ κ²μ κΉ¨
κΉ¨μ£½μΈλ° μ¬λλ€μ ν콩 λ¨μ½© 콩죽 κΉ¨μ£½ μ£½λ¨ΉκΈ°λ₯Ό μ«μ΄νλλΌ.
>>> print(sentence == sentence2)
True