• Stars
    star
    197
  • Rank 197,722 (Top 4 %)
  • Language
    Python
  • License
    GNU General Publi...
  • Created over 8 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An integrated library for Korean language preprocessing.

hangul-utils

An integrated library for Korean language preprocessing.

Build Status

This simple library has following features.

  • Text Normalization: typo and error correction using open-korean-text.
  • Tokenization: sentence and word-level tokenizations using Mecab-ko.
  • Character Manipulation: splitting and combining of jamo characters (our own implementation)

Python 2 is no longer supported.

Getting Started

For text normalization, Open Korean Text is required. For tokenizations, Mecab-ko is required. First, run install_mecab_ko.sh with sudo to install Mecab-ko system-wide.

sudo bash install_mecab_ko.sh

Note that LD_LIBRARY_PATH must be set to point to /usr/local/lib:/usr/lib

The script above will set that for you temporarily, but you must set it yourself after a restart.

Then install Open Korean Text Python Wrapper by running

bash install_twkorean.sh

Sudo is not required for this one.

Finally install hangul-utils package by cloning this repo and running

# install from source
python setup.py install

Optionally, you could install the package from pypi, but it is not recommended, as some of the required packages install properly only when installed from git repositories.

Text Normalization

Text normalization is necessary for reducing noises in texts collected online or transcribed from spoken language. To this date, the only open-sourced library that tackles the problem is Open Korean Text.

Open Korean Text's normalization function is not meant to deal with all error cases. The source code indicates that the normalization process largely focuses on fixing typing errors and deals less on linguistic errors itself. The entire process consists of following procedures:

  • Removal of repeating jamos that have accidently become the final (e.g. -γ…‹) of the preceding character: μ™΄γ…‹γ…‹γ…‹γ…‹-> μ™œγ…‹γ…‹γ…‹γ…‹
  • Shortening of jamos that repeat more than twice: γ…‹γ…‹γ…‹γ…‹γ…‹γ…‹γ…‹->γ…‹γ…‹
  • Shortening of characters that repeat more than twice: ν›Œμ©ν›Œμ©ν›Œμ© -> ν›Œμ©ν›Œμ©
  • Normalization of frequently shortened words: this mainly includes three cases (-ㄴ데, -ㄴ지, -γ„΄κ°€)
  • Normalization of tense jamos: γ…†->γ……, ν–‡λŠ”λ° -> ν–ˆλŠ”λ°
  • Space normalization: \n\n\n -> \n

Any errors or variations beyond above cases will not be normalized.

Usages

Normalization is available as a Preprocessor method or as a separate function.

>>> from hangul_utils import Preprocessor
>>> p = Preprocessor()
>>> p.normalize("λΆ€λ“€λΆ€λ“€λΆ€λ“€λΆ€λ“€ λ‚΄κ°€ μž‘κ°„λ° ν™”κ°€λ‚°γ…‹γ…‹γ…‹γ…‹")
"λΆ€λ“€λΆ€λ“€ λ‚΄κ°€ μž‘κ°€μΈλ° ν™”κ°€λ‚˜γ…‹γ…‹γ…‹"

>>> from hangul_utils import normalize
>>> normalize("λΆ€λ“€λΆ€λ“€λΆ€λ“€λΆ€λ“€ λ‚΄κ°€ μž‘κ°„λ° ν™”κ°€λ‚°γ…‹γ…‹γ…‹γ…‹")
"λΆ€λ“€λΆ€λ“€ λ‚΄κ°€ μž‘κ°€μΈλ° ν™”κ°€λ‚˜γ…‹γ…‹γ…‹"

Tokenizations

Sentence and word tokenization methods are available in this library, supported by mecab-ko as the backend. Twitter's Korean text library also provides part-of-speech tokenization methods, but it generally lacks the level of robustness offered by other taggers. For example, a common grammatical mistake people make is to not separate "ν• " and "수" when describing the ability to do something. Twitter's tagger fails to tokenize correctly in such cases:

>>> twitter_tokenize("μž‘κ°€κ°€ ν• μˆ˜ μžˆλŠ”μΌμ΄ μžˆμ§€.")
[('μž‘κ°€', 'Noun'),
 ('κ°€', 'Josa'),
 ('ν• μˆ˜', 'Verb'),
 ('μžˆλŠ”', 'Adjective'),
 ('일이', 'Noun'),
 ('μžˆμ§€', 'Adjective'),
 ('.', 'Punctuation')]

However, Mecab produces more ideal results for the same sentence due to the way it ignores spaces during part-of-speech analysis:

>>> mecab_tokenize("μž‘κ°€κ°€ ν• μˆ˜ μžˆλŠ”μΌμ΄ μžˆμ§€.")
[('μž‘κ°€', 'NNG'),
 ('κ°€', 'JKS'),
 ('ν• ', 'VV+ETM'),
 ('수', 'NNB'),
 ('있', 'VV'),
 ('λŠ”', 'ETM'),
 ('일', 'NNG'),
 ('이', 'JKS'),
 ('있', 'VA'),
 ('지', 'EF'),
 ('.', 'SF')]

Additionally, Mecab is known to be much faster.

Mecab also supports SF part-of-speech to indicate the end of a sentence (on which our sentence tokenizer is based) unlike Twitter's tagger that classifies all punctutations and special symbols to a more general Punctutation tag.

Overall, we found that using Mecab has proven to be much more productive in most of our cases. However, it is entirely possible that other taggers could be more suitable for other cases. We do not claim that our experience will be the same for others.

Usages

Sentence tokenization:

>>> from hangul_utils import sent_tokenize
>>> list(sent_tokenize("κ·ΈλŸ¬λ‚˜ λ² λ„€μˆ˜μ—˜λΌλŠ” 독일 보닀 ν•œ 단계 μœ„μ˜€λ‹€. ν˜„ μ‹œμ μ—μ„œ λˆˆμ— λ„λŠ” μ„ μˆ˜κ°€ λͺ‡λͺ‡ μžˆλ‹€."))
['κ·ΈλŸ¬λ‚˜ λ² λ„€μˆ˜μ—˜λΌλŠ” 독일 보닀 ν•œ 단계 μœ„μ˜€λ‹€.', 'ν˜„ μ‹œμ μ—μ„œ λˆˆμ— λ„λŠ” μ„ μˆ˜κ°€ λͺ‡λͺ‡ μžˆλ‹€.']

Word tokenization (mainly using space):

>>> from hangul_utils import word_tokenize
>>> list(word_tokenize("κ·ΈλŸ¬λ‚˜ λ² λ„€μˆ˜μ—˜λΌλŠ” 독일 보닀 ν•œ 단계 μœ„μ˜€λ‹€."))
['κ·ΈλŸ¬λ‚˜', 'λ² λ„€μˆ˜μ—˜λΌλŠ”', '독일', '보닀', 'ν•œ', '단계', 'μœ„μ˜€λ‹€', '.']

Morpheme tokenization:

>>> from hangul_utils import morph_tokenize
>>> list(morph_tokenize("κ·ΈλŸ¬λ‚˜ λ² λ„€μˆ˜μ—˜λΌλŠ” 독일 보닀 ν•œ 단계 μœ„μ˜€λ‹€."))
['κ·ΈλŸ¬λ‚˜', 'λ² λ„€μˆ˜μ—˜λΌ', 'λŠ”', '독일', '보닀', 'ν•œ', '단계', 'μœ„', 'μ˜€', 'λ‹€', '.']

Morpheme tokenization with POS:

>>> from hangul_utils import morph_tokenize
>>> list(morph_tokenize("κ·ΈλŸ¬λ‚˜ λ² λ„€μˆ˜μ—˜λΌλŠ” 독일 보닀 ν•œ 단계 μœ„μ˜€λ‹€.", pos=True))
[('ν˜„', 'MM'),
 ('μ‹œμ ', 'NNG'),
 ('μ—μ„œ', 'JKB'),
 ('눈', 'NNG'),
 ('에', 'JKB'),
 ('띄', 'VV'),
 ('λŠ”', 'ETM'),
 ('μ„ μˆ˜', 'NNG'),
 ('κ°€', 'JKS'),
 ('λͺ‡λͺ‡', 'MM'),
 ('있', 'VA'),
 ('λ‹€', 'EF'),
 ('.', 'SF')]

Simultaneous sentence and word tokenization (more efficient than calling each of them in succession):

>>> from hangul_utils import sent_word_tokenize
>>> list(sent_word_tokenize("κ·ΈλŸ¬λ‚˜ λ² λ„€μˆ˜μ—˜λΌλŠ” 독일 보닀 ν•œ 단계 μœ„μ˜€λ‹€. ν˜„ μ‹œμ μ—μ„œ λˆˆμ— λ„λŠ” μ„ μˆ˜κ°€ λͺ‡λͺ‡ μžˆλ‹€."))
[['κ·ΈλŸ¬λ‚˜', 'λ² λ„€μˆ˜μ—˜λΌλŠ”', '독일', '보닀', 'ν•œ', '단계', 'μœ„μ˜€λ‹€', '.'],
['ν˜„', 'μ‹œμ μ—μ„œ', 'λˆˆμ—', 'λ„λŠ”', 'μ„ μˆ˜κ°€', 'λͺ‡λͺ‡', 'μžˆλ‹€', '.']]

Simultaneous sentence and morpheme tokenization (more efficient than calling each of them in succession):

>>> from hangul_utils import sent_morph_tokenize
>>> list(sent_morph_tokenize("κ·ΈλŸ¬λ‚˜ λ² λ„€μˆ˜μ—˜λΌλŠ” 독일 보닀 ν•œ 단계 μœ„μ˜€λ‹€. ν˜„ μ‹œμ μ—μ„œ λˆˆμ— λ„λŠ” μ„ μˆ˜κ°€ λͺ‡λͺ‡ μžˆλ‹€."))
[['κ·ΈλŸ¬λ‚˜', 'λ² λ„€μˆ˜μ—˜λΌ', 'λŠ”', '독일', '보닀', 'ν•œ', '단계', 'μœ„', 'μ˜€', 'λ‹€', '.'],
['ν˜„', 'μ‹œμ ', 'μ—μ„œ', '눈', '에', '띄', 'λŠ”', 'μ„ μˆ˜', 'κ°€', 'λͺ‡λͺ‡', '있', 'λ‹€', '.']]

Omission of incomplete sentences:

>>> from hangul_utils import sent_tokenize
>>> list(sent_tokenize("κ·ΈλŸ¬λ‚˜ λ² λ„€μˆ˜μ—˜λΌλŠ” 독일 보닀 ν•œ 단계 μœ„μ˜€λ‹€. ν˜„ μ‹œμ μ—μ„œ λˆˆμ— λ„λŠ” μ„ μˆ˜κ°€", residual=False))
['κ·ΈλŸ¬λ‚˜ λ² λ„€μˆ˜μ—˜λΌλŠ” 독일 보닀 ν•œ 단계 μœ„μ˜€λ‹€.']

Again, these functions are also available as methods of Preprocessor.

Manipulating Korean Characters

Hangul is made of basic letters called 'jamo(자λͺ¨)', and thus it is an agglutinative language. As such, a tool for splitting and joining jamos could come very handy when we want to perform character-level Korean text processing. Splitting a Korean character is quite straight-forward: a simple algebraic formula can deduce the unicode value of individual jamo (Wiki). However, the tricky part is forming a string of Hanguls from a string of jamos, because the consonants can be either initials or finals, some form of backtracking is require.

Functions

  • split_syllables: converts a string of syllables to a string of jamos
  • join_jamos: converts a string of jamos to a string of syllables

Usages

>>> from hangul_utils import split_syllable_char, split_syllables,
    join_jamos
>>> print(split_syllable_char(u"μ•ˆ"))
('γ…‡', 'ㅏ', 'γ„΄')

>>> print(split_syllables(u"μ•ˆλ…•ν•˜μ„Έμš”"))
γ…‡γ…γ„΄γ„΄γ…•γ…‡γ…Žγ…γ……γ…”γ…‡γ…›

>>> sentence = u"μ•ž 집 νŒ₯죽은 뢉은 νŒ₯ ν’‹νŒ₯죽이고, 뒷집 콩죽은 햇콩 단콩 콩죽.우리 집
    깨죽은 검은 κΉ¨ 깨죽인데 μ‚¬λžŒλ“€μ€ 햇콩 단콩 콩죽 κΉ¨μ£½ μ£½λ¨ΉκΈ°λ₯Ό μ‹«μ–΄ν•˜λ”λΌ."
>>> s = split_syllables(sentence)
>>> print(s)
ㅇㅏㅍ γ…ˆγ…£γ…‚ γ…γ…γ…Œγ…ˆγ…œγ„±γ…‡γ…‘γ„΄ γ…‚γ…œγ„Ίγ…‡γ…‘γ„΄ γ…γ…γ…Œ γ…γ…œγ……γ…γ…γ…Œγ…ˆγ…œγ„±γ…‡γ…£γ„±γ…—,
γ„·γ…Ÿγ……γ…ˆγ…£γ…‚ γ…‹γ…—γ…‡γ…ˆγ…œγ„±γ…‡γ…‘γ„΄ γ…Žγ…γ……γ…‹γ…—γ…‡ ㄷㅏㄴㅋㅗㅇ γ…‹γ…—γ…‡γ…ˆγ…œγ„±.γ…‡γ…œγ„Ήγ…£
γ…ˆγ…£γ…‚ γ„²γ…γ…ˆγ…œγ„±γ…‡γ…‘γ„΄ ㄱㅓㅁㅇㅑㄴ ㄲㅐ γ„²γ…γ…ˆγ…œγ„±γ…‡γ…£γ„΄γ„·γ…” ㅅㅏㄹㅏㅁㄷㅑㄹㅇㅑㄴ
γ…Žγ…γ……γ…‹γ…—γ…‡ ㄷㅏㄴㅋㅗㅇ γ…‹γ…—γ…‡γ…ˆγ…œγ„± γ„²γ…γ…ˆγ…œγ„± γ…ˆγ…œγ„±γ…γ…“γ„±γ„±γ…£γ„Ήγ…‘γ„Ή
γ……γ…£γ…€γ…‡γ…“γ…Žγ…γ„·γ…“γ„Ήγ….

>>> sentence2 = join_jamos(s)
>>> print(sentence2)
μ•ž 집 νŒ₯죽은 뢉은 νŒ₯ ν’‹νŒ₯죽이고, 뒷집 콩죽은 햇콩 단콩 콩죽.우리 집 깨죽은 검은 κΉ¨
깨죽인데 μ‚¬λžŒλ“€μ€ 햇콩 단콩 콩죽 κΉ¨μ£½ μ£½λ¨ΉκΈ°λ₯Ό μ‹«μ–΄ν•˜λ”λΌ.

>>> print(sentence == sentence2)
True