Korean Sentence Splitter
Split Korean text into sentences using heuristic algorithm. This algorithm was greatly inspired by EungGyun Kim <[email protected]> who is Kakao NLP Leader and one of the most brilliant NLP Engineers in Korea.
I've started this project inspired by this article and we've achieved best result on the test set. And of course, It's very robust to both Spoken and Written expressions.
NOTICE(Written in Korean - 21 Dec, 2020):
μλ μ¦μμ μΉ΄μΉ΄μ€μμ μ¬μ©ν΄μ€λ νκ΅μ΄ λ¬Έμ₯ λΆλ¦¬κΈ°μ μ°©μν΄ C++λ‘ νκ΅μ΄ λ¬Έμ₯ λΆλ¦¬κΈ°λ₯Ό μλ‘κ² λ§λ€μκ³ , μ€νμμ€λ‘ 곡κ°νμ¬ μ’μ λ°μμ μ»μ λ° μμ΅λλ€. κ·Έλ¬λ μ§μμ μΈ λ¬Έμλ μμ² μ¬νμλ κ±°μ λμμ νμ§ λͺ»νκ³ , κ²°μ μ μΌλ‘ μ λν μ ν λ€λ₯Έ νλ‘μ νΈλ₯Ό λ§‘κ² λμ΄ λ μ΄μ μ μ§λ³΄μλ₯Ό ν μκ° μμλλ°μ. νΉν C++λ‘ κ΅¬ννλ€ λ³΄λ λΉλ λ¬Έμκ° μ λ§ λ§μκ³ , μλλ Windows, Mac, Linux κ° OSλ³λ‘ λΉλνμ¬ λ°μ΄λ리λ₯Ό μ λ‘λνμ¬ λ°°ν¬νλλ‘ κΆμ₯νκ³ μλλ°, μμ© νλ‘μ νΈλ μλκ³ κ·Έλ κ² κΉμ§ κ΄λ¦¬ν μλ μμμ΅λλ€.
κ²°μ μ μΌλ‘ λ¬Έμ₯ λΆλ¦¬ μμ²΄κ° μμ²λ κ³ μ±λ₯μ μꡬν΄μ κΌ C++λ‘ μμ±ν΄μΌ νλ건 μλμκΈ°μ λΉλμ λ²κ±°λ‘μκ³Ό κ°λ° μ μ§λ³΄μλ₯Ό κ°μνλ©΄ μ΄μ λ€λ₯Έ μΈμ΄λ‘ λ°κΎΈλκ² μ’κ² λ€κ³ μκ°νλ μ°¨, λ§μΉ¨ κ³ νμ λκ»μ νμ΄μ¬μΌλ‘ λͺ¨λ ν¬ν ν΄μ£Όμ ¨κ³ , λ κΎΈμ€ν κ°μ ν΄ λκ°λ λͺ¨μ΅μ 보면μ μ΄μ λ νλ‘μ νΈλ₯Ό λ겨μ€λκ° λμμ κΉ¨λ¬μμ΅λλ€.
μ κ° μ¬ μ΄μ 1.3.1κΉμ§ μ¬λ Έμκ³ , μ€λλΆν° κ³ νμ λμ΄ λ§λμ νμ΄μ¬ ν¬ν λ²μ μΌλ‘ 2.0.0μ΄ μμλ©λλ€. κ·Έλμ μ μκ² λ€μ΄μμλ, μ κ° μ²λ¦¬νμ§ λͺ»νλ λͺ¨λ μ΄μμ PRμ΄ λ°μλ μ΅μ’ κ°μ λ²μ μ΄κ³ , μλ§ μμΌλ‘λ μ κ°μ ν΄μ£Όμλ¦¬λΌ κΈ°λκ° ν½λλ€.
ν©ν€μ§ μ€μΉλ κΈ°μ‘΄κ³Ό λμΌνκ²
pip install kss
λ‘ κ°λ₯νλ©°, λ²κ·Έλ κ°μ κ³Ό κ΄λ ¨ν μ΄μλ μλ‘μ΄ λ ν¬μΈ https://github.com/hyunwoongko/kss μ μ¬λ €μ£Όμλ©΄ λ©λλ€.μμΌλ‘λ λ§μ μμ λΆνλ립λλ€.
κ°μ¬ν©λλ€.
https://www.facebook.com/groups/TensorFlowKR/permalink/1383839988623722/
Installation
The package is listed in the Python Package Index (PyPI), so you can install it with pip:
$ pip install kss
Usage
import kss
s = "νμ¬ λλ£ λΆλ€κ³Ό λ€λ
μλλ° λΆμκΈ°λ μ’κ³ μμλ λ§μμμ΄μ λ€λ§, κ°λ¨ ν λΌμ μ΄ κ°λ¨ μμλ²κ±° 골λͺ©κΈΈλ‘ μ μ¬λΌκ°μΌ νλλ° λ€λ€ μμλ²κ±°μ μ νΉμ λμ΄κ° λ» νλ΅λλ€ κ°λ¨μ λ§μ§ ν λΌμ μ μΈλΆ λͺ¨μ΅."
for sent in kss.split_sentences(s):
print(sent)
The result is shown below:
νμ¬ λλ£ λΆλ€κ³Ό λ€λ
μλλ° λΆμκΈ°λ μ’κ³ μμλ λ§μμμ΄μ
λ€λ§, κ°λ¨ ν λΌμ μ΄ κ°λ¨ μμλ²κ±° 골λͺ©κΈΈλ‘ μ μ¬λΌκ°μΌ νλλ° λ€λ€ μμλ²κ±°μ μ νΉμ λμ΄κ° λ» νλ΅λλ€
κ°λ¨μ λ§μ§ ν λΌμ μ μΈλΆ λͺ¨μ΅.
Demo
Requirements
Mac, Linux
- C++11
- GCC or Clang with C++11 build supported.
- Python 3+
NOTICE: Google Test binary provided was built on macOS.
Windows
- Microsoft C++ Build Tools
- Python 3+
- Cython
$ pip install cython
Build from scratch
C++
$ mkdir bld
$ cd bld
$ cmake ..
$ make
$ ./sentsplit
NOTICE: Google Test binary provided was built on macOS only. So, You cannot build test binary on linux.
#include <iostream>
#include "sentence_splitter.h"
int main() {
std::string s = "νμ¬ λλ£ λΆλ€κ³Ό λ€λ
μλλ° λΆμκΈ°λ μ’κ³ μμλ λ§μμμ΄μ λ€λ§, κ°λ¨ ν λΌμ μ΄ κ°λ¨ μμλ²κ±° 골λͺ©κΈΈλ‘ μ μ¬λΌκ°μΌ νλλ° λ€λ€ μμλ²κ±°μ μ νΉμ λμ΄κ° λ» νλ΅λλ€ κ°λ¨μ λ§μ§ ν λΌμ μ μΈλΆ λͺ¨μ΅.";
for (auto sent : splitSentences(s)) {
std::cout << sent << std::endl;
}
return 0;
}
The result is shown below:
νμ¬ λλ£ λΆλ€κ³Ό λ€λ
μλλ° λΆμκΈ°λ μ’κ³ μμλ λ§μμμ΄μ
λ€λ§, κ°λ¨ ν λΌμ μ΄ κ°λ¨ μμλ²κ±° 골λͺ©κΈΈλ‘ μ μ¬λΌκ°μΌ νλλ° λ€λ€ μμλ²κ±°μ μ νΉμ λμ΄κ° λ» νλ΅λλ€
κ°λ¨μ λ§μ§ ν λΌμ μ μΈλΆ λͺ¨μ΅.
Python
Python wrapper has implemented using Cython. You can execute build tasks by the command below:
$ python setup.py install --record files.txt
or
$ pip install .
Uninstall
$ xargs rm -rf < files.txt
or
$ pip uninstall kss
PyPI
$ python setup.py sdist
$ twine upload --repository-url https://test.pypi.org/legacy/ dist/*