PyKoSpacing
Python package for automatic Korean word spacing.
R verson can be found here.
Introduction
Word spacing is one of the important parts of the preprocessing of Korean text analysis. Accurate spacing greatly affects the accuracy of subsequent text analysis. PyKoSpacing
has fairly accurate automatic word spacing performance,especially good for online text originated from SNS or SMS.
For example.
"μλ²μ§κ°λ°©μλ€μ΄κ°μ λ€." can be spaced both of below.
- "μλ²μ§κ° λ°©μ λ€μ΄κ°μ λ€." means "My father enters the room."
- "μλ²μ§ κ°λ°©μ λ€μ΄κ°μ λ€." means "My father goes into the bag."
Common sense, the first is the right answer.
PyKoSpacing
is based on Deep Learning model trained from large corpus(more than 100 million NEWS articles from Chan-Yub Park).
Performance
Test Set | Accuracy |
---|---|
Sejong(colloquial style) Corpus(1M) | 97.1% |
OOOO(literary style) Corpus(3M) | 94.3% |
- Accuracy = # correctly spaced characters/# characters in the test data.
- Might be increased performance if normalize compound words.
Install
PyPI Install
Pre-requisite:
proper installation of python3
proper installation of pip
pip install tensorflow
pip install keras
Windows-Ubuntu case: On following error.
On error: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.22' not found
sudo apt-get install libstdc++6
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get upgrade
sudo apt-get dist-upgrade (This takes long time.)
Darwin(m1) case: You should install tensorflow in a different way.(Use Miniforge3)
# Install Miniforge3 for mac
curl -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
chmod +x Miniforge3-MacOSX-arm64.sh
sh Miniforge3-MacOSX-arm64.sh
# Activate Miniforge3 virtualenv
# You should use Python version 3.10 or less.
source ~/miniforge3/bin/activate
# Install the Tensorflow dependencies
conda install -c apple tensorflow-deps
# Install base tensorflow
python -m pip install tensorflow-macos
# Install metal plugin
python -m pip install tensorflow-metal
To install from GitHub, use
pip install git+https://github.com/haven-jeon/PyKoSpacing.git
Example
>>> from pykospacing import Spacing
>>> spacing = Spacing()
>>> spacing("κΉννΈμνμμ₯λΆμκ°λ'1987'μλ€μ΄λ²μνμ 보λ€ν°μ¦10μ νμμμΈκΈλλ¨μ΄λ€μμ§λν΄12μ27μΌλΆν°μ¬ν΄1μ10μΌκΉμ§ν΅κ³νλ‘κ·Έλ¨Rκ³ΌKoNLPν¨ν€μ§λ‘ν
μ€νΈλ§μ΄λνμ¬λΆμνλ€.")
"κΉννΈ μνμμ₯ λΆμκ°λ '1987'μ λ€μ΄λ² μν μ 보 λ€ν°μ¦ 10μ νμμ μΈκΈλ λ¨μ΄λ€μ μ§λν΄ 12μ 27μΌλΆν° μ¬ν΄ 1μ 10μΌκΉμ§ ν΅κ³ νλ‘κ·Έλ¨ Rκ³Ό KoNLP ν¨ν€μ§λ‘ ν
μ€νΈλ§μ΄λνμ¬ λΆμνλ€."
>>> # Apply a list of words that must be non-spacing
>>> spacing('κ·λ°μμν±κΉμ§μλ°λΌλμμΌμꡬλ λλ£»μ΄λΌκ³ νλ€.')
'κ· λ°μμ ν±κΉμ§ μλ°λΌ λ μμΌμ ꡬλ λ λ£»μ΄λΌκ³ νλ€.'
>>> spacing = Spacing(rules=['ꡬλ λλ£»'])
>>> spacing('κ·λ°μμν±κΉμ§μλ°λΌλμμΌμꡬλ λλ£»μ΄λΌκ³ νλ€.')
'κ· λ°μμ ν±κΉμ§ μλ°λΌ λ μμΌμ ꡬλ λλ£»μ΄λΌκ³ νλ€.'
Setting rules with csv file. (you only need to use set_rules_by_csv()
method.)
$ cat test.csv
μΈλ±μ€,λ¨μ΄
1,λ€μ΄λ²μν
2,μΈκΈλλ¨μ΄
>>> from pykospacing import Spacing
>>> spacing = Spacing(rules=[''])
>>> spacing.set_rules_by_csv('./test.csv', 'λ¨μ΄')
>>> spacing("κΉννΈμνμμ₯λΆμκ°λ'1987'μλ€μ΄λ²μνμ 보λ€ν°μ¦10μ νμμμΈκΈλλ¨μ΄λ€μμ§λν΄12μ27μΌλΆν°μ¬ν΄1μ10μΌκΉμ§ν΅κ³νλ‘κ·Έλ¨Rκ³ΌKoNLPν¨ν€μ§λ‘ν
μ€νΈλ§μ΄λνμ¬λΆμνλ€.")
"κΉννΈ μνμμ₯ λΆμκ°λ '1987'μ λ€μ΄λ²μν μ 보 λ€ν°μ¦ 10μ νμμ μΈκΈλλ¨μ΄λ€μ μ§λν΄ 12μ 27μΌλΆν° μ¬ν΄ 1μ 10μΌκΉμ§ ν΅κ³ νλ‘κ·Έλ¨ Rκ³Ό KoNLP ν¨ν€μ§λ‘ ν
μ€νΈλ§μ΄λνμ¬ λΆμνλ€."
Run on command line(thanks lqez).
$ cat test_in.txt
κΉννΈμνμμ₯λΆμκ°λ'1987'μλ€μ΄λ²μνμ 보λ€ν°μ¦10μ νμμμΈκΈλλ¨μ΄λ€μμ§λν΄12μ27μΌλΆν°μ¬ν΄1μ10μΌκΉμ§ν΅κ³νλ‘κ·Έλ¨Rκ³ΌKoNLPν¨ν€μ§λ‘ν
μ€νΈλ§μ΄λνμ¬λΆμνλ€.
μλ²μ§κ°λ°©μλ€μ΄κ°μ λ€.
$ python -m pykospacing.pykos test_in.txt
κΉννΈ μνμμ₯ λΆμκ°λ '1987'μ λ€μ΄λ² μν μ 보 λ€ν°μ¦ 10μ νμμ μΈκΈλ λ¨μ΄λ€μ μ§λν΄ 12μ 27μΌλΆν° μ¬ν΄ 1μ 10μΌκΉμ§ ν΅κ³ νλ‘κ·Έλ¨ Rκ³Ό KoNLP ν¨ν€μ§λ‘ ν
μ€νΈλ§μ΄λνμ¬ λΆμνλ€.
μλ²μ§κ° λ°©μ λ€μ΄κ°μ λ€.
Model Architecture
For Training
- Training code uses an architecture that is more advanced than PyKoSpacing, but also contains the learning logic of PyKoSpacing.
Citation
@misc{heewon2018,
author = {Heewon Jeon},
title = {KoSpacing: Automatic Korean word spacing},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/haven-jeon/KoSpacing}}