• Stars
    star
    199
  • Rank 189,904 (Top 4 %)
  • Language
    C++
  • License
    BSD 3-Clause "New...
  • Created almost 5 years ago
  • Updated over 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Split Korean text into sentences using heuristic algorithm.

Korean Sentence Splitter

Split Korean text into sentences using heuristic algorithm. This algorithm was greatly inspired by EungGyun Kim <[email protected]> who is Kakao NLP Leader and one of the most brilliant NLP Engineers in Korea.

I've started this project inspired by this article and we've achieved best result on the test set. And of course, It's very robust to both Spoken and Written expressions.

NOTICE(Written in Korean - 21 Dec, 2020):

μž‘λ…„ μ¦ˆμŒμ— μΉ΄μΉ΄μ˜€μ—μ„œ μ‚¬μš©ν•΄μ˜€λ˜ ν•œκ΅­μ–΄ λ¬Έμž₯ 뢄리기에 μ°©μ•ˆν•΄ C++둜 ν•œκ΅­μ–΄ λ¬Έμž₯ 뢄리기λ₯Ό μƒˆλ‘­κ²Œ λ§Œλ“€μ—ˆκ³ , μ˜€ν”ˆμ†ŒμŠ€λ‘œ κ³΅κ°œν•˜μ—¬ 쒋은 λ°˜μ‘μ„ 얻은 λ°” μžˆμŠ΅λ‹ˆλ‹€. κ·ΈλŸ¬λ‚˜ 지속적인 λ¬Έμ˜λ‚˜ μš”μ²­ μ‚¬ν•­μ—λŠ” 거의 λŒ€μ‘μ„ ν•˜μ§€ λͺ»ν–ˆκ³ , κ²°μ •μ μœΌλ‘œ μ € λ˜ν•œ μ „ν˜€ λ‹€λ₯Έ ν”„λ‘œμ νŠΈλ₯Ό 맑게 λ˜μ–΄ 더 이상 μœ μ§€λ³΄μˆ˜λ₯Ό ν•  μˆ˜κ°€ μ—†μ—ˆλŠ”λ°μš”. 특히 C++둜 κ΅¬ν˜„ν•˜λ‹€ λ³΄λ‹ˆ λΉŒλ“œ λ¬Έμ˜κ°€ 정말 λ§Žμ•˜κ³ , μ›λž˜λŠ” Windows, Mac, Linux 각 OSλ³„λ‘œ λΉŒλ“œν•˜μ—¬ λ°”μ΄λ„ˆλ¦¬λ₯Ό μ—…λ‘œλ“œν•˜μ—¬ λ°°ν¬ν•˜λ„λ‘ ꢌμž₯ν•˜κ³  μžˆλŠ”λ°, μƒμš© ν”„λ‘œμ νŠΈλ„ μ•„λ‹ˆκ³  κ·Έλ ‡κ²Œ κΉŒμ§€ 관리할 μˆ˜λŠ” μ—†μ—ˆμŠ΅λ‹ˆλ‹€.

κ²°μ •μ μœΌλ‘œ λ¬Έμž₯ 뢄리 μžμ²΄κ°€ μ—„μ²­λ‚œ κ³ μ„±λŠ₯을 μš”κ΅¬ν•΄μ„œ κΌ­ C++둜 μž‘μ„±ν•΄μ•Ό ν•˜λŠ”κ±΄ μ•„λ‹ˆμ—ˆκΈ°μ— λΉŒλ“œμ˜ λ²ˆκ±°λ‘œμ›€κ³Ό 개발 μœ μ§€λ³΄μˆ˜λ₯Ό κ°μ•ˆν•˜λ©΄ 이제 λ‹€λ₯Έ μ–Έμ–΄λ‘œ λ°”κΎΈλŠ”κ²Œ μ’‹κ² λ‹€κ³  μƒκ°ν•˜λ˜ μ°¨, 마침 κ³ ν˜„μ›…λ‹˜κ»˜μ„œ 파이썬으둜 λͺ¨λ‘ ν¬νŒ…ν•΄μ£Όμ…¨κ³ , 또 κΎΈμ€€νžˆ κ°œμ„ ν•΄ λ‚˜κ°€λŠ” λͺ¨μŠ΅μ„ λ³΄λ©΄μ„œ μ΄μ œλŠ” ν”„λ‘œμ νŠΈλ₯Ό λ„˜κ²¨μ€„λ•Œκ°€ λμŒμ„ κΉ¨λ‹¬μ•˜μŠ΅λ‹ˆλ‹€.

μ œκ°€ 올 μ΄ˆμ— 1.3.1κΉŒμ§€ μ˜¬λ Έμ—ˆκ³ , μ˜€λŠ˜λΆ€ν„° κ³ ν˜„μ›…λ‹˜μ΄ λ§Œλ“œμ‹  파이썬 ν¬νŒ… λ²„μ „μœΌλ‘œ 2.0.0이 μ‹œμž‘λ©λ‹ˆλ‹€. κ·Έλ™μ•ˆ μ €μ—κ²Œ λ“€μ–΄μ™€μžˆλ˜, μ œκ°€ μ²˜λ¦¬ν•˜μ§€ λͺ»ν–ˆλ˜ λͺ¨λ“  μ΄μŠˆμ™€ PR이 반영된 μ΅œμ’… κ°œμ„  버전이고, μ•„λ§ˆ μ•žμœΌλ‘œλ„ 잘 κ°œμ„ ν•΄μ£Όμ‹œλ¦¬λΌ κΈ°λŒ€κ°€ ν½λ‹ˆλ‹€.

νŒ©ν‚€μ§€ μ„€μΉ˜λŠ” κΈ°μ‘΄κ³Ό λ™μΌν•˜κ²Œ pip install kss둜 κ°€λŠ₯ν•˜λ©°, λ²„κ·Έλ‚˜ κ°œμ„ κ³Ό κ΄€λ ¨ν•œ μ΄μŠˆλŠ” μƒˆλ‘œμš΄ 레포인 https://github.com/hyunwoongko/kss 에 μ˜¬λ €μ£Όμ‹œλ©΄ λ©λ‹ˆλ‹€.

μ•žμœΌλ‘œλ„ λ§Žμ€ 응원 λΆ€νƒλ“œλ¦½λ‹ˆλ‹€.
κ°μ‚¬ν•©λ‹ˆλ‹€.

https://www.facebook.com/groups/TensorFlowKR/permalink/1383839988623722/

Installation

The package is listed in the Python Package Index (PyPI), so you can install it with pip:

$ pip install kss

Usage

import kss

s = "νšŒμ‚¬ λ™λ£Œ λΆ„λ“€κ³Ό λ‹€λ…€μ™”λŠ”λ° λΆ„μœ„κΈ°λ„ μ’‹κ³  μŒμ‹λ„ λ§›μžˆμ—ˆμ–΄μš” λ‹€λ§Œ, 강남 토끼정이 강남 쉑쉑버거 골λͺ©κΈΈλ‘œ μ­‰ μ˜¬λΌκ°€μ•Ό ν•˜λŠ”λ° λ‹€λ“€ μ‰‘μ‰‘λ²„κ±°μ˜ μœ ν˜Ήμ— λ„˜μ–΄κ°ˆ λ»” ν–ˆλ‹΅λ‹ˆλ‹€ 강남역 맛집 ν† λΌμ •μ˜ μ™ΈλΆ€ λͺ¨μŠ΅."
for sent in kss.split_sentences(s):
    print(sent)

The result is shown below:

νšŒμ‚¬ λ™λ£Œ λΆ„λ“€κ³Ό λ‹€λ…€μ™”λŠ”λ° λΆ„μœ„κΈ°λ„ μ’‹κ³  μŒμ‹λ„ λ§›μžˆμ—ˆμ–΄μš”
λ‹€λ§Œ, 강남 토끼정이 강남 쉑쉑버거 골λͺ©κΈΈλ‘œ μ­‰ μ˜¬λΌκ°€μ•Ό ν•˜λŠ”λ° λ‹€λ“€ μ‰‘μ‰‘λ²„κ±°μ˜ μœ ν˜Ήμ— λ„˜μ–΄κ°ˆ λ»” ν–ˆλ‹΅λ‹ˆλ‹€
강남역 맛집 ν† λΌμ •μ˜ μ™ΈλΆ€ λͺ¨μŠ΅.

Demo

Requirements

Mac, Linux

  • C++11
    • GCC or Clang with C++11 build supported.
  • Python 3+

NOTICE: Google Test binary provided was built on macOS.

Windows

  • Microsoft C++ Build Tools
  • Python 3+
  • Cython
$ pip install cython

Build from scratch

C++

$ mkdir bld
$ cd bld
$ cmake ..
$ make
$ ./sentsplit

NOTICE: Google Test binary provided was built on macOS only. So, You cannot build test binary on linux.

#include <iostream>
#include "sentence_splitter.h"

int main() {
    std::string s = "νšŒμ‚¬ λ™λ£Œ λΆ„λ“€κ³Ό λ‹€λ…€μ™”λŠ”λ° λΆ„μœ„κΈ°λ„ μ’‹κ³  μŒμ‹λ„ λ§›μžˆμ—ˆμ–΄μš” λ‹€λ§Œ, 강남 토끼정이 강남 쉑쉑버거 골λͺ©κΈΈλ‘œ μ­‰ μ˜¬λΌκ°€μ•Ό ν•˜λŠ”λ° λ‹€λ“€ μ‰‘μ‰‘λ²„κ±°μ˜ μœ ν˜Ήμ— λ„˜μ–΄κ°ˆ λ»” ν–ˆλ‹΅λ‹ˆλ‹€ 강남역 맛집 ν† λΌμ •μ˜ μ™ΈλΆ€ λͺ¨μŠ΅.";
    for (auto sent : splitSentences(s)) {
        std::cout << sent << std::endl;
    }

    return 0;
}

The result is shown below:

νšŒμ‚¬ λ™λ£Œ λΆ„λ“€κ³Ό λ‹€λ…€μ™”λŠ”λ° λΆ„μœ„κΈ°λ„ μ’‹κ³  μŒμ‹λ„ λ§›μžˆμ—ˆμ–΄μš”
λ‹€λ§Œ, 강남 토끼정이 강남 쉑쉑버거 골λͺ©κΈΈλ‘œ μ­‰ μ˜¬λΌκ°€μ•Ό ν•˜λŠ”λ° λ‹€λ“€ μ‰‘μ‰‘λ²„κ±°μ˜ μœ ν˜Ήμ— λ„˜μ–΄κ°ˆ λ»” ν–ˆλ‹΅λ‹ˆλ‹€
강남역 맛집 ν† λΌμ •μ˜ μ™ΈλΆ€ λͺ¨μŠ΅.

Python

Python wrapper has implemented using Cython. You can execute build tasks by the command below:

$ python setup.py install --record files.txt
or
$ pip install .

Uninstall

$ xargs rm -rf < files.txt
or
$ pip uninstall kss

PyPI

$ python setup.py sdist
$ twine upload --repository-url https://test.pypi.org/legacy/ dist/*