• Stars
    star
    315
  • Rank 132,951 (Top 3 %)
  • Language
    Python
  • License
    GNU General Publi...
  • Created over 6 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Automatic Korean word spacing with Python

PyKoSpacing

Python package for automatic Korean word spacing.

R verson can be found here.

License: GPL v3

Introduction

Word spacing is one of the important parts of the preprocessing of Korean text analysis. Accurate spacing greatly affects the accuracy of subsequent text analysis. PyKoSpacing has fairly accurate automatic word spacing performance,especially good for online text originated from SNS or SMS.

For example.

"아버지가방에듀어가신닀." can be spaced both of below.

  1. "아버지가 방에 λ“€μ–΄κ°€μ‹ λ‹€." means "My father enters the room."
  2. "아버지 가방에 λ“€μ–΄κ°€μ‹ λ‹€." means "My father goes into the bag."

Common sense, the first is the right answer.

PyKoSpacing is based on Deep Learning model trained from large corpus(more than 100 million NEWS articles from Chan-Yub Park).

Performance

Test Set Accuracy
Sejong(colloquial style) Corpus(1M) 97.1%
OOOO(literary style) Corpus(3M) 94.3%
  • Accuracy = # correctly spaced characters/# characters in the test data.
    • Might be increased performance if normalize compound words.

Install

PyPI Install

Pre-requisite:

proper installation of python3
proper installation of pip

pip install tensorflow
pip install keras


Windows-Ubuntu case: On following error.
On error: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.22' not found
   sudo apt-get install libstdc++6
   sudo add-apt-repository ppa:ubuntu-toolchain-r/test
   sudo apt-get update
   sudo apt-get upgrade
   sudo apt-get dist-upgrade (This takes long time.)

Darwin(m1) case: You should install tensorflow in a different way.(Use Miniforge3)

# Install Miniforge3 for mac
curl -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
chmod +x Miniforge3-MacOSX-arm64.sh
sh Miniforge3-MacOSX-arm64.sh
# Activate Miniforge3 virtualenv
# You should use Python version 3.10 or less.
source ~/miniforge3/bin/activate
# Install the Tensorflow dependencies 
conda install -c apple tensorflow-deps 
# Install base tensorflow 
python -m pip install tensorflow-macos 
# Install metal plugin 
python -m pip install tensorflow-metal

To install from GitHub, use

pip install git+https://github.com/haven-jeon/PyKoSpacing.git

Example

>>> from pykospacing import Spacing
>>> spacing = Spacing()
>>> spacing("κΉ€ν˜•ν˜Έμ˜ν™”μ‹œμž₯λΆ„μ„κ°€λŠ”'1987'μ˜λ„€μ΄λ²„μ˜ν™”μ •λ³΄λ„€ν‹°μ¦Œ10μ ν‰μ—μ„œμ–ΈκΈ‰λœλ‹¨μ–΄λ“€μ„μ§€λ‚œν•΄12μ›”27μΌλΆ€ν„°μ˜¬ν•΄1μ›”10μΌκΉŒμ§€ν†΅κ³„ν”„λ‘œκ·Έλž¨Rκ³ΌKoNLPνŒ¨ν‚€μ§€λ‘œν…μŠ€νŠΈλ§ˆμ΄λ‹ν•˜μ—¬λΆ„μ„ν–ˆλ‹€.")
"κΉ€ν˜•ν˜Έ μ˜ν™”μ‹œμž₯ λΆ„μ„κ°€λŠ” '1987'의 넀이버 μ˜ν™” 정보 λ„€ν‹°μ¦Œ 10점 ν‰μ—μ„œ μ–ΈκΈ‰λœ 단어듀을 μ§€λ‚œν•΄ 12μ›” 27일뢀터 μ˜¬ν•΄ 1μ›” 10μΌκΉŒμ§€ 톡계 ν”„λ‘œκ·Έλž¨ Rκ³Ό KoNLP νŒ¨ν‚€μ§€λ‘œ ν…μŠ€νŠΈλ§ˆμ΄λ‹ν•˜μ—¬ λΆ„μ„ν–ˆλ‹€."
>>> # Apply a list of words that must be non-spacing
>>> spacing('κ·€λ°‘μ—μ„œν„±κΉŒμ§€μž‡λ”°λΌλ‚œμˆ˜μ—Όμ„κ΅¬λ ˆλ‚˜λ£»μ΄λΌκ³ ν•œλ‹€.')
'κ·€ λ°‘μ—μ„œ ν„±κΉŒμ§€ μž‡λ”°λΌ λ‚œ μˆ˜μ—Όμ„ κ΅¬λ ˆλ‚˜ 룻이라고 ν•œλ‹€.'
>>> spacing = Spacing(rules=['κ΅¬λ ˆλ‚˜λ£»'])
>>> spacing('κ·€λ°‘μ—μ„œν„±κΉŒμ§€μž‡λ”°λΌλ‚œμˆ˜μ—Όμ„κ΅¬λ ˆλ‚˜λ£»μ΄λΌκ³ ν•œλ‹€.')
'κ·€ λ°‘μ—μ„œ ν„±κΉŒμ§€ μž‡λ”°λΌ λ‚œ μˆ˜μ—Όμ„ κ΅¬λ ˆλ‚˜λ£»μ΄λΌκ³  ν•œλ‹€.'

Setting rules with csv file. (you only need to use set_rules_by_csv() method.)

$ cat test.csv
인덱슀,단어
1,λ„€μ΄λ²„μ˜ν™”
2,μ–ΈκΈ‰λœλ‹¨μ–΄
>>> from pykospacing import Spacing
>>> spacing = Spacing(rules=[''])
>>> spacing.set_rules_by_csv('./test.csv', '단어')
>>> spacing("κΉ€ν˜•ν˜Έμ˜ν™”μ‹œμž₯λΆ„μ„κ°€λŠ”'1987'μ˜λ„€μ΄λ²„μ˜ν™”μ •λ³΄λ„€ν‹°μ¦Œ10μ ν‰μ—μ„œμ–ΈκΈ‰λœλ‹¨μ–΄λ“€μ„μ§€λ‚œν•΄12μ›”27μΌλΆ€ν„°μ˜¬ν•΄1μ›”10μΌκΉŒμ§€ν†΅κ³„ν”„λ‘œκ·Έλž¨Rκ³ΌKoNLPνŒ¨ν‚€μ§€λ‘œν…μŠ€νŠΈλ§ˆμ΄λ‹ν•˜μ—¬λΆ„μ„ν–ˆλ‹€.")
"κΉ€ν˜•ν˜Έ μ˜ν™”μ‹œμž₯ λΆ„μ„κ°€λŠ” '1987'의 λ„€μ΄λ²„μ˜ν™” 정보 λ„€ν‹°μ¦Œ 10점 ν‰μ—μ„œ μ–ΈκΈ‰λœλ‹¨μ–΄λ“€μ„ μ§€λ‚œν•΄ 12μ›” 27일뢀터 μ˜¬ν•΄ 1μ›” 10μΌκΉŒμ§€ 톡계 ν”„λ‘œκ·Έλž¨ Rκ³Ό KoNLP νŒ¨ν‚€μ§€λ‘œ ν…μŠ€νŠΈλ§ˆμ΄λ‹ν•˜μ—¬ λΆ„μ„ν–ˆλ‹€."

Run on command line(thanks lqez).

$ cat test_in.txt
κΉ€ν˜•ν˜Έμ˜ν™”μ‹œμž₯λΆ„μ„κ°€λŠ”'1987'μ˜λ„€μ΄λ²„μ˜ν™”μ •λ³΄λ„€ν‹°μ¦Œ10μ ν‰μ—μ„œμ–ΈκΈ‰λœλ‹¨μ–΄λ“€μ„μ§€λ‚œν•΄12μ›”27μΌλΆ€ν„°μ˜¬ν•΄1μ›”10μΌκΉŒμ§€ν†΅κ³„ν”„λ‘œκ·Έλž¨Rκ³ΌKoNLPνŒ¨ν‚€μ§€λ‘œν…μŠ€νŠΈλ§ˆμ΄λ‹ν•˜μ—¬λΆ„μ„ν–ˆλ‹€.
아버지가방에듀어가신닀.
$ python -m pykospacing.pykos test_in.txt
κΉ€ν˜•ν˜Έ μ˜ν™”μ‹œμž₯ λΆ„μ„κ°€λŠ” '1987'의 넀이버 μ˜ν™” 정보 λ„€ν‹°μ¦Œ 10점 ν‰μ—μ„œ μ–ΈκΈ‰λœ 단어듀을 μ§€λ‚œν•΄ 12μ›” 27일뢀터 μ˜¬ν•΄ 1μ›” 10μΌκΉŒμ§€ 톡계 ν”„λ‘œκ·Έλž¨ Rκ³Ό KoNLP νŒ¨ν‚€μ§€λ‘œ ν…μŠ€νŠΈλ§ˆμ΄λ‹ν•˜μ—¬ λΆ„μ„ν–ˆλ‹€.
아버지가 방에 λ“€μ–΄κ°€μ‹ λ‹€.

Model Architecture

For Training

Citation

@misc{heewon2018,
author = {Heewon Jeon},
title = {KoSpacing: Automatic Korean word spacing},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/haven-jeon/KoSpacing}}

Star History

Star History Chart

More Repositories

1

KoGPT2-chatbot

Simple Chit-Chat based on KoGPT2
Python
163
star
2

KoNLP

R package for Korean NLP
Java
152
star
3

LegalQA

Korean LegalQA using SentenceKoBART
Python
83
star
4

KoSpacing

Automatic Korean word spacing with R
R
76
star
5

ko_en_neural_machine_translation

Korean English NMT(Neural Machine Translation) with Gluon
Jupyter Notebook
60
star
6

TensorFlow-Book-R

This is the unofficial code repository for Machine Learning with TensorFlow(R).
HTML
58
star
7

KoBART-chatbot

KoBART chatbot
Python
45
star
8

TrainKoSpacing

Automatic Korean word spacing with neural n-gram detector(NND)
Python
35
star
9

NIADic

NIA(National Information Society Agency) Hangul Dictionary
R
33
star
10

KoWordSpacing

Korean Word Spacing with RNN.
HTML
22
star
11

KoGPT2-subtasks

NSMC, KorSTS ... fine-tunings
Python
20
star
12

ko_data_science_docker

데이터 뢄석 λͺ¨λΈλ§μš© 도컀 이미지
Dockerfile
18
star
13

grad_cam_gluon

Grad CAM for Text Classification
Jupyter Notebook
16
star
14

HDKU

HDKU : Hangul Dubeolsik Keystroke Utils
Python
10
star
15

beer_recommander

keras based beer recommendation
HTML
8
star
16

korea_real_estate_analysis

뢀동산 데이터 뢄석 μ½”λ“œ 및 데이터
HTML
7
star
17

rdatamining

R Data Mining ꡐ윑자료
7
star
18

HanNanum-Analyzer

HanNanum Analyzer for KoNLP
Java
5
star
19

2014_Seoul_Mayoral_Election_Analysis

Bayesian Inference using Opinion Survey of Seoul Mayoral Election 2014
TeX
5
star
20

DeepLearning_with_R

Deep Learning with R
4
star
21

knitr_example

knitr example
R
4
star
22

Ruchardet

R port of 'universalchardet', that is the encoding detector library of Mozilla.
C++
4
star
23

BOPR

Bayesian online learning scheme for probit regression with R
R
3
star
24

R_based_visualization

Rλ‘œν•˜λŠ” 데이터 μ‹œκ°ν™”
3
star
25

introduction_to_most_usable_pkgs_in_project

μ‹€μ œ ν”„λ‘œμ νŠΈ ν•˜λ©΄μ„œ ν•„μˆ˜μΈ νŒ¨ν‚€μ§€λ“€ μ†Œκ°œ
R
3
star
26

reproducible-data-analysis-examples

reproducible data analysis examples
R
1
star
27

RscriptUtils

RscriptUtils, Tools to make developing Rscript easier
R
1
star
28

CIIA_Korean

Collective Intelligence in Action Examples for Korean
Java
1
star
29

Sejong

KoNLP static dictionaries and Sejong project resources for corpus linguistics.
R
1
star
30

calcifer-vtuber

1
star
31

GMenuNext

C++
1
star