• Stars
    star
    361
  • Rank 113,759 (Top 3 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created over 3 years ago
  • Updated 8 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

KSS: Korean String processing Suite

Kss: A Toolkit for Korean sentence segmentation

GitHub release Issues

This repository contains the source code of Kss, a representative Korean sentence segmentation toolkit. I also conduct ongoing research about Korean sentence segmentation algorithms and report the results to this repository. If you have some good ideas about Korean sentence segmentation, please feel free to talk through the issue.


What's New:

Installation

Install Kss

Kss can be easily installed using the pip package manager.

pip install kss

Install Mecab (Optional)

Please install mecab or konlpy.tag.Mecab to use Kss much faster.

Features

1) split_sentences: split text into sentences

from kss import split_sentences

split_sentences(
    text: Union[str, List[str], Tuple[str]],
    backend: str = "auto",
    num_workers: Union[int, str] = "auto" ,
    strip: bool = True,
    ignores: List[str] = None,
)
Parameters
  • text: String or List/Tuple of strings
    • string: single text segmentation
    • list/tuple of strings: batch texts segmentation
  • backend: Morpheme analyzer backend
    • backend='auto': find mecab โ†’ konlpy.tag.Mecab โ†’ pecab โ†’ punct and use first found analyzer (default)
    • backend='mecab': find mecab โ†’ konlpy.tag.Mecab and use first found analyzer
    • backend='pecab': use pecab analyzer
    • backend='punct': split sentences only near punctuation marks
  • num_workers: The number of multiprocessing workers
    • num_workers='auto': use multiprocessing with the maximum number of workers if possible (default)
    • num_workers=1: don't use multiprocessing
    • num_workers=2~N: use multiprocessing with the specified number of workers
  • strip: Whether it does strip() for all output sentences or not
    • strip=True: do strip() for all output sentences (default)
    • strip=False: do not strip() for all output sentences
  • ignores: ignore strings to do not split
    • See detailed usage from the following Usages
Usages
  • Single text segmentation

    import kss
    
    text = "ํšŒ์‚ฌ ๋™๋ฃŒ ๋ถ„๋“ค๊ณผ ๋‹ค๋…€์™”๋Š”๋ฐ ๋ถ„์œ„๊ธฐ๋„ ์ข‹๊ณ  ์Œ์‹๋„ ๋ง›์žˆ์—ˆ์–ด์š” ๋‹ค๋งŒ, ๊ฐ•๋‚จ ํ† ๋ผ์ •์ด ๊ฐ•๋‚จ ์‰‘์‰‘๋ฒ„๊ฑฐ ๊ณจ๋ชฉ๊ธธ๋กœ ์ญ‰ ์˜ฌ๋ผ๊ฐ€์•ผ ํ•˜๋Š”๋ฐ ๋‹ค๋“ค ์‰‘์‰‘๋ฒ„๊ฑฐ์˜ ์œ ํ˜น์— ๋„˜์–ด๊ฐˆ ๋ป” ํ–ˆ๋‹ต๋‹ˆ๋‹ค ๊ฐ•๋‚จ์—ญ ๋ง›์ง‘ ํ† ๋ผ์ •์˜ ์™ธ๋ถ€ ๋ชจ์Šต."
    
    kss.split_sentences(text)
    # ['ํšŒ์‚ฌ ๋™๋ฃŒ ๋ถ„๋“ค๊ณผ ๋‹ค๋…€์™”๋Š”๋ฐ ๋ถ„์œ„๊ธฐ๋„ ์ข‹๊ณ  ์Œ์‹๋„ ๋ง›์žˆ์—ˆ์–ด์š”', '๋‹ค๋งŒ, ๊ฐ•๋‚จ ํ† ๋ผ์ •์ด ๊ฐ•๋‚จ ์‰‘์‰‘๋ฒ„๊ฑฐ ๊ณจ๋ชฉ๊ธธ๋กœ ์ญ‰ ์˜ฌ๋ผ๊ฐ€์•ผ ํ•˜๋Š”๋ฐ ๋‹ค๋“ค ์‰‘์‰‘๋ฒ„๊ฑฐ์˜ ์œ ํ˜น์— ๋„˜์–ด๊ฐˆ ๋ป” ํ–ˆ๋‹ต๋‹ˆ๋‹ค', '๊ฐ•๋‚จ์—ญ ๋ง›์ง‘ ํ† ๋ผ์ •์˜ ์™ธ๋ถ€ ๋ชจ์Šต.']
  • Batch texts segmentation

    import kss
    
    texts = [
        "ํšŒ์‚ฌ ๋™๋ฃŒ ๋ถ„๋“ค๊ณผ ๋‹ค๋…€์™”๋Š”๋ฐ ๋ถ„์œ„๊ธฐ๋„ ์ข‹๊ณ  ์Œ์‹๋„ ๋ง›์žˆ์—ˆ์–ด์š” ๋‹ค๋งŒ, ๊ฐ•๋‚จ ํ† ๋ผ์ •์ด ๊ฐ•๋‚จ ์‰‘์‰‘๋ฒ„๊ฑฐ ๊ณจ๋ชฉ๊ธธ๋กœ ์ญ‰ ์˜ฌ๋ผ๊ฐ€์•ผ ํ•˜๋Š”๋ฐ ๋‹ค๋“ค ์‰‘์‰‘๋ฒ„๊ฑฐ์˜ ์œ ํ˜น์— ๋„˜์–ด๊ฐˆ ๋ป” ํ–ˆ๋‹ต๋‹ˆ๋‹ค",
        "๊ฐ•๋‚จ์—ญ ๋ง›์ง‘ ํ† ๋ผ์ •์˜ ์™ธ๋ถ€ ๋ชจ์Šต. ๊ฐ•๋‚จ ํ† ๋ผ์ •์€ 4์ธต ๊ฑด๋ฌผ ๋…์ฑ„๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.",
        "์—ญ์‹œ ํ† ๋ผ์ • ๋ณธ ์  ๋‹ต์ฃ ?ใ…Žใ……ใ…Ž ๊ฑด๋ฌผ์€ ํฌ์ง€๋งŒ ๊ฐ„ํŒ์ด ์—†๊ธฐ ๋•Œ๋ฌธ์— ์ง€๋‚˜์น  ์ˆ˜ ์žˆ์œผ๋‹ˆ ์กฐ์‹ฌํ•˜์„ธ์š” ๊ฐ•๋‚จ ํ† ๋ผ์ •์˜ ๋‚ด๋ถ€ ์ธํ…Œ๋ฆฌ์–ด.",
    ]
    
    kss.split_sentences(texts)
    # [['ํšŒ์‚ฌ ๋™๋ฃŒ ๋ถ„๋“ค๊ณผ ๋‹ค๋…€์™”๋Š”๋ฐ ๋ถ„์œ„๊ธฐ๋„ ์ข‹๊ณ  ์Œ์‹๋„ ๋ง›์žˆ์—ˆ์–ด์š”', '๋‹ค๋งŒ, ๊ฐ•๋‚จ ํ† ๋ผ์ •์ด ๊ฐ•๋‚จ ์‰‘์‰‘๋ฒ„๊ฑฐ ๊ณจ๋ชฉ๊ธธ๋กœ ์ญ‰ ์˜ฌ๋ผ๊ฐ€์•ผ ํ•˜๋Š”๋ฐ ๋‹ค๋“ค ์‰‘์‰‘๋ฒ„๊ฑฐ์˜ ์œ ํ˜น์— ๋„˜์–ด๊ฐˆ ๋ป” ํ–ˆ๋‹ต๋‹ˆ๋‹ค']
    # ['๊ฐ•๋‚จ์—ญ ๋ง›์ง‘ ํ† ๋ผ์ •์˜ ์™ธ๋ถ€ ๋ชจ์Šต.', '๊ฐ•๋‚จ ํ† ๋ผ์ •์€ 4์ธต ๊ฑด๋ฌผ ๋…์ฑ„๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.']
    # ['์—ญ์‹œ ํ† ๋ผ์ • ๋ณธ ์  ๋‹ต์ฃ ?ใ…Žใ……ใ…Ž', '๊ฑด๋ฌผ์€ ํฌ์ง€๋งŒ ๊ฐ„ํŒ์ด ์—†๊ธฐ ๋•Œ๋ฌธ์— ์ง€๋‚˜์น  ์ˆ˜ ์žˆ์œผ๋‹ˆ ์กฐ์‹ฌํ•˜์„ธ์š”', '๊ฐ•๋‚จ ํ† ๋ผ์ •์˜ ๋‚ด๋ถ€ ์ธํ…Œ๋ฆฌ์–ด.']]
  • Remain all prefixes/suffixes space characters for original text recoverability

    import kss
    
    text = "ํšŒ์‚ฌ ๋™๋ฃŒ ๋ถ„๋“ค๊ณผ ๋‹ค๋…€์™”๋Š”๋ฐ ๋ถ„์œ„๊ธฐ๋„ ์ข‹๊ณ  ์Œ์‹๋„ ๋ง›์žˆ์—ˆ์–ด์š”\n๋‹ค๋งŒ, ๊ฐ•๋‚จ ํ† ๋ผ์ •์ด ๊ฐ•๋‚จ ์‰‘์‰‘๋ฒ„๊ฑฐ ๊ณจ๋ชฉ๊ธธ๋กœ ์ญ‰ ์˜ฌ๋ผ๊ฐ€์•ผ ํ•˜๋Š”๋ฐ ๋‹ค๋“ค ์‰‘์‰‘๋ฒ„๊ฑฐ์˜ ์œ ํ˜น์— ๋„˜์–ด๊ฐˆ ๋ป” ํ–ˆ๋‹ต๋‹ˆ๋‹ค ๊ฐ•๋‚จ์—ญ ๋ง›์ง‘ ํ† ๋ผ์ •์˜ ์™ธ๋ถ€ ๋ชจ์Šต."
    
    kss.split_sentences(text)
    # ['ํšŒ์‚ฌ ๋™๋ฃŒ ๋ถ„๋“ค๊ณผ ๋‹ค๋…€์™”๋Š”๋ฐ ๋ถ„์œ„๊ธฐ๋„ ์ข‹๊ณ  ์Œ์‹๋„ ๋ง›์žˆ์—ˆ์–ด์š”\n', '๋‹ค๋งŒ, ๊ฐ•๋‚จ ํ† ๋ผ์ •์ด ๊ฐ•๋‚จ ์‰‘์‰‘๋ฒ„๊ฑฐ ๊ณจ๋ชฉ๊ธธ๋กœ ์ญ‰ ์˜ฌ๋ผ๊ฐ€์•ผ ํ•˜๋Š”๋ฐ ๋‹ค๋“ค ์‰‘์‰‘๋ฒ„๊ฑฐ์˜ ์œ ํ˜น์— ๋„˜์–ด๊ฐˆ ๋ป” ํ–ˆ๋‹ต๋‹ˆ๋‹ค ', '๊ฐ•๋‚จ์—ญ ๋ง›์ง‘ ํ† ๋ผ์ •์˜ ์™ธ๋ถ€ ๋ชจ์Šต.']
  • Ignore strings from sentence splitting

    import kss
    
    text = """์ฒซ์งธ. ๋ฒ ํŠธ๋‚จ ์ง€์—ญ์—์„œ๋Š” ์ผ์ฐ๋ถ€ํ„ฐ ๋ฐ˜๋ž‘๊ตญ, ์–ด์šฐ๋ฝ ์™•๊ตญ, ๋‚จ๋น„์—ฃ(๋‚จ์›”) ๋“ฑ์ด ๊ฑด๊ตญ๋˜์–ด ๋ฐœ์ „ํ•˜์˜€๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ํ•œ ๋ฌด์ œ์˜ ๊ณต๊ฒฉ์œผ๋กœ ์ค‘๊ตญ์˜ ์ง€๋ฐฐ๋ฅผ ๋ฐ›๊ธฐ ์‹œ์ž‘ํ•˜๋ฉด์„œ ์ค‘๊ตญ ๋ฌธํ™”์˜ ์˜ํ–ฅ์„ ๋ฐ›๊ฒŒ ๋˜์—ˆ๋‹ค. ํŠนํžˆ ๋‹น์˜ ์ง€๋ฐฐ๋ฅผ ๋ฐ›์œผ๋ฉด์„œ ๋‹น ๋ฌธํ™”์˜ ์˜ํ–ฅ์„ ๋งŽ์ด ๋ฐ›์•˜๋‹ค.
    ๋‘˜์งธ. ๋ฒ ํŠธ๋‚จ์—์„œ๋„ ์ค‘๊ตญ ๋ฌธํ™”์˜ ์˜ํ–ฅ ์†์—์„œ ์œ ๊ต ๋ฌธํ™”๊ฐ€ ๋ฐœ๋‹ฌํ•˜์˜€๋‹ค. ํŠนํžˆ ๋ฒ ํŠธ๋‚จ์˜ ๋ฆฌ ์™•์กฐ ๋•Œ์—๋Š” ๋ฌธ๋ฌ˜๊ฐ€ ์„ค์น˜๋˜๊ณ , ๊ณผ๊ฑฐ์ œ๊ฐ€ ์‹œํ–‰๋˜๊ธฐ๋„ ํ•˜์˜€๋‹ค. ํ•œํŽธ ๋ ˆ(ํ›„๊ธฐ) ์™•์กฐ ๋•Œ์—๋Š” ์„ฑ๋ฆฌํ•™์„ ๋ฐ”ํƒ•์œผ๋กœ ํ•œ ์œ ๊ต ๋ฌธํ™”๊ฐ€ ํ™•์‚ฐ๋˜์—ˆ๋‹ค.
    ์…‹์งธ. ๋ฒ ํŠธ๋‚จ์—์„œ๋Š” ๊ฐ•์ˆ˜๋Ÿ‰์ด ํ’๋ถ€ํ•˜๊ณ , ๋‚ ์”จ๊ฐ€ ๋”ฐ๋œปํ•˜์—ฌ ๋ฒผ๋†์‚ฌ ์ค‘์‹ฌ์˜ ๋†๊ฒฝ ์ƒํ™œ์ด ์ด๋ฃจ์–ด์ง€๊ณ  ์žˆ๋‹ค.
    ๋„ท์งธ. ๋™์•„์‹œ์•„ ์ง€์—ญ์€ ๊ณ„์ ˆ์— ๋”ฐ๋ผ ๋ฐฉํ–ฅ์ด ๋ฐ”๋€Œ๋Š” ๊ณ„์ ˆํ’์˜ ์˜ํ–ฅ์„ ๊ฐ•ํ•˜๊ฒŒ ๋ฐ›๋Š” ๊ณณ์ด๋‹ค. ์„œ์•ˆ ํ•ด์–‘์„ฑ ๊ธฐํ›„๋Š” ์ค‘์œ„๋„์˜ ๋Œ€๋ฅ™ ์„œ์ชฝ ์ง€์—ญ์— ์ฃผ๋กœ ๋‚˜ํƒ€๋‚œ๋‹ค.
    """
    
    output = kss.split_sentences(text, ignores=["์ฒซ์งธ.", "๋‘˜์งธ.", "์…‹์งธ.", "๋„ท์งธ."])
    print(output)
    # ['์ฒซ์งธ. ๋ฒ ํŠธ๋‚จ ์ง€์—ญ์—์„œ๋Š” ์ผ์ฐ๋ถ€ํ„ฐ ๋ฐ˜๋ž‘๊ตญ, ์–ด์šฐ๋ฝ ์™•๊ตญ, ๋‚จ๋น„์—ฃ(๋‚จ์›”) ๋“ฑ์ด ๊ฑด๊ตญ๋˜์–ด ๋ฐœ์ „ํ•˜์˜€๋‹ค.', '๊ทธ๋Ÿฌ๋‚˜ ํ•œ ๋ฌด์ œ์˜ ๊ณต๊ฒฉ์œผ๋กœ ์ค‘๊ตญ์˜ ์ง€๋ฐฐ๋ฅผ ๋ฐ›๊ธฐ ์‹œ์ž‘ํ•˜๋ฉด์„œ ์ค‘๊ตญ ๋ฌธํ™”์˜ ์˜ํ–ฅ์„ ๋ฐ›๊ฒŒ ๋˜์—ˆ๋‹ค.', 'ํŠนํžˆ ๋‹น์˜ ์ง€๋ฐฐ๋ฅผ ๋ฐ›์œผ๋ฉด์„œ ๋‹น ๋ฌธํ™”์˜ ์˜ํ–ฅ์„ ๋งŽ์ด ๋ฐ›์•˜๋‹ค.', '๋‘˜์งธ. ๋ฒ ํŠธ๋‚จ์—์„œ๋„ ์ค‘๊ตญ ๋ฌธํ™”์˜ ์˜ํ–ฅ ์†์—์„œ ์œ ๊ต ๋ฌธํ™”๊ฐ€ ๋ฐœ๋‹ฌํ•˜์˜€๋‹ค.', 'ํŠนํžˆ ๋ฒ ํŠธ๋‚จ์˜ ๋ฆฌ ์™•์กฐ ๋•Œ์—๋Š” ๋ฌธ๋ฌ˜๊ฐ€ ์„ค์น˜๋˜๊ณ , ๊ณผ๊ฑฐ์ œ๊ฐ€ ์‹œํ–‰๋˜๊ธฐ๋„ ํ•˜์˜€๋‹ค.', 'ํ•œํŽธ ๋ ˆ(ํ›„๊ธฐ) ์™•์กฐ ๋•Œ์—๋Š” ์„ฑ๋ฆฌํ•™์„ ๋ฐ”ํƒ•์œผ๋กœ ํ•œ ์œ ๊ต ๋ฌธํ™”๊ฐ€ ํ™•์‚ฐ๋˜์—ˆ๋‹ค.', '์…‹์งธ. ๋ฒ ํŠธ๋‚จ์—์„œ๋Š” ๊ฐ•์ˆ˜๋Ÿ‰์ด ํ’๋ถ€ํ•˜๊ณ , ๋‚ ์”จ๊ฐ€ ๋”ฐ๋œปํ•˜์—ฌ ๋ฒผ๋†์‚ฌ ์ค‘์‹ฌ์˜ ๋†๊ฒฝ ์ƒํ™œ์ด ์ด๋ฃจ์–ด์ง€๊ณ  ์žˆ๋‹ค.', '๋„ท์งธ. ๋™์•„์‹œ์•„ ์ง€์—ญ์€ ๊ณ„์ ˆ์— ๋”ฐ๋ผ ๋ฐฉํ–ฅ์ด ๋ฐ”๋€Œ๋Š” ๊ณ„์ ˆํ’์˜ ์˜ํ–ฅ์„ ๊ฐ•ํ•˜๊ฒŒ ๋ฐ›๋Š” ๊ณณ์ด๋‹ค.', '์„œ์•ˆ ํ•ด์–‘์„ฑ ๊ธฐํ›„๋Š” ์ค‘์œ„๋„์˜ ๋Œ€๋ฅ™ ์„œ์ชฝ ์ง€์—ญ์— ์ฃผ๋กœ ๋‚˜ํƒ€๋‚œ๋‹ค.']    
Performance Analysis

1) Test Commands

You can reproduce all the following results using source code and datasets in ./bench/ directory and the source code was copied from here. Note that the Baseline is regex based segmentation method (re.split(r"(?<=[.!?])\s", text)).

Name Command (in root directory)
Baseline python3 ./bench/test_baseline.py ./bench/testset/*.txt
Kiwi python3 ./bench/test_kiwi.py ./bench/testset/*.txt
Koalanlp python3 ./bench/test_koalanlp.py ./bench/testset/*.txt --backend=OKT/HNN/KMR/RHINO/EUNJEON/ARIRANG/KKMA
Kss (ours) python3 ./bench/test_kss.py ./bench/testset/*.txt --backend=mecab/pecab

2) Evaluation datasets:

I used the following 7 evaluation datasets for the follwing experiments. Thanks to Minchul Lee for creating various sentence segmentation datasets.

Name Descriptions The number of sentences Creator
blogs_lee Dataset for testing blog style text segmentation 170 Minchul Lee
blogs_ko Dataset for testing blog style text segmentation, which is harder than Lee's blog dataset 346 Hyunwoong Ko
sample An example used in README.md (๊ฐ•๋‚จ ํ† ๋ผ์ •) 41 Isaac, modified by Hyunwoong Ko
tweets Dataset for testing tweeter style text segmentation 178 Minchul Lee
wikipedia Dataset for testing wikipedia style text segmentation 326 Hyunwoong Ko
nested Dataset for testing text which have parentheses and quotation marks segmentation 91 Minchul Lee
v_ending Dataset for testing difficult eomi segmentation, it contains various dialect sentences 30 Minchul Lee

Note that I modified labels of two sentences in sample.txt made by Issac because the original blog post was written like the following:

But Issac's labels were:

In fact, ์‚ฌ์‹ค ์ „ ๊ณ ๊ธฐ๋ฅผ ์•ˆ ๋จน์–ด์„œ ๋ฌด์Šจ ๋ง›์ธ์ง€ ๋ชจ๋ฅด๊ฒ ์ง€๋งŒ.. and (๋ฌผ๋ก  ์ „ ์•ˆ ๋จน์—ˆ์ง€๋งŒ are embraced sentences (์•ˆ๊ธด๋ฌธ์žฅ), not independent sentences. So sentence segmentation tools should do not split that parts.


3) Sentence segmentation performance (Quantitative Analysis)

The following tables show the segmentation performance based on Exact Match (EM), F1 score (F1) and Normalized F1 score (NF1).

  • EM score: This only gives score when the output predictions are exactly the same with gold labels. This could be useful, but too harsh and clunky.
Name Library version Backend blogs_lee (EM) blogs_ko (EM) sample (EM) tweets (EM) wikipedia (EM) nested (EM) v_ending (EM) Average (EM)
Baseline N/A N/A 0.53529 0.43642 0.34146 0.51124 0.66258 0.68132 0.00000 0.45261
Koalanlp 2.1.7 OKT 0.53529 0.43642 0.36585 0.53371 0.65951 0.79121 0.00000 0.47457
Koalanlp 2.1.7 HNN 0.54118 0.44220 0.34146 0.54494 0.67791 0.78022 0.00000 0.47541
Koalanlp 2.1.7 KMR 0.51176 0.38439 0.26829 0.42135 0.45706 0.79121 0.00000 0.40486
Koalanlp 2.1.7 RHINO 0.52941 0.41329 0.29268 0.39326 0.67791 0.79121 0.00000 0.44253
Koalanlp 2.1.7 EUNJEON 0.51176 0.38728 0.21951 0.38202 0.59816 0.70330 0.00000 0.40029
Koalanlp 2.1.7 ARIRANG 0.51176 0.41618 0.29268 0.44382 0.66564 0.79121 0.00000 0.44589
Koalanlp 2.1.7 KKMA 0.52941 0.45954 0.31707 0.38202 0.57669 0.58242 0.06667 0.41626
Kiwi 0.14.1 N/A 0.78235 0.61272 0.90244 0.66292 0.63804 0.83516 0.20000 0.66194
Kss (ours) 4.2.0 pecab 0.87059 0.82659 0.95122 0.74157 0.98160 0.86813 0.36667 0.80091
Kss (ours) 4.2.0 mecab 0.87059 0.82659 0.95122 0.75281 1.00000 0.86813 0.36667 0.80514

  • F1 score (dice similarity): This calculates the overlap between the output predictions and gold labels. It means this gives score even if the output predictions are not exactly same with gold labels. This is less reliable because this gives huge advantages to splitters which separate sentences too finely.
Name Library version Backend blogs_lee (F1) blogs_ko (F1) sample (F1) tweets (F1) wikipedia (F1) nested (F1) v_ending (F1) Average (F1)
Baseline N/A N/A 0.66847 0.55724 0.54732 0.65446 0.76664 0.85438 0.11359 0.59458
Koalanlp 2.1.7 OKT 0.66847 0.55724 0.58642 0.69434 0.76639 0.93010 0.11359 0.61665
Koalanlp 2.1.7 HNN 0.69341 0.59185 0.57092 0.70350 0.98116 0.94163 0.11359 0.65658
Koalanlp 2.1.7 KMR 0.63506 0.48661 0.49026 0.56364 0.54806 0.85426 0.11359 0.52735
Koalanlp 2.1.7 RHINO 0.68313 0.53548 0.52258 0.57900 0.96743 0.85426 0.11359 0.60792
Koalanlp 2.1.7 EUNJEON 0.67063 0.54010 0.48446 0.65018 0.91846 0.80233 0.11359 0.59710
Koalanlp 2.1.7 ARIRANG 0.69407 0.57230 0.56872 0.67882 0.97884 0.85426 0.11359 0.63722
Koalanlp 2.1.7 KKMA 0.78127 0.66599 0.78335 0.56832 0.92527 0.89952 0.30797 0.70457
Kiwi 0.14.1 N/A 0.91323 0.76214 0.96003 0.84503 0.97740 0.98447 0.38535 0.83252
Kss (ours) 4.2.0 pecab 0.92162 0.90335 0.96826 0.82720 0.98801 0.93012 0.48153 0.86001
Kss (ours) 4.2.0 mecab 0.92162 0.90335 0.96826 0.83329 1.00000 0.93012 0.48153 0.86259

  • Normalized F1 score: This is the most reliable metric made by the Kss project. It makes up for the downside of the F1 score by penalizing splitters which separate too finely.
Name Library version Backend blogs_lee (NF1) blogs_ko (NF1) sample (NF1) tweets (NF1) wikipedia (NF1) nested (NF1) v_ending (NF1) Average (NF1)
Baseline N/A N/A 0.59884 0.52607 0.54732 0.61806 0.76379 0.75991 0.11359 0.56108
Koalanlp 2.1.7 OKT 0.62168 0.55724 0.58642 0.66198 0.76354 0.83832 0.11359 0.59182
Koalanlp 2.1.7 HNN 0.62515 0.57098 0.57092 0.66922 0.97286 0.82031 0.11359 0.62043
Koalanlp 2.1.7 KMR 0.61636 0.48412 0.49026 0.55535 0.54806 0.85426 0.11359 0.52314
Koalanlp 2.1.7 RHINO 0.63619 0.51835 0.52258 0.55140 0.95886 0.85426 0.11359 0.59360
Koalanlp 2.1.7 EUNJEON 0.62104 0.52132 0.48446 0.57766 0.91307 0.80233 0.11359 0.57261
Koalanlp 2.1.7 ARIRANG 0.58979 0.51149 0.56872 0.53500 0.94617 0.85426 0.11359 0.58843
Koalanlp 2.1.7 KKMA 0.73972 0.64048 0.78335 0.56408 0.89218 0.75068 0.30797 0.66835
Kiwi 0.14.1 N/A 0.84378 0.72367 0.93717 0.79056 0.91031 0.92687 0.34179 0.78202
Kss (ours) 4.2.0 pecab 0.88878 0.88605 0.96826 0.80771 0.98160 0.92063 0.48153 0.84957
Kss (ours) 4.2.0 mecab 0.88878 0.88605 0.96826 0.81379 1.00000 0.92063 0.48153 0.85129

Kss performed best in most metrics and datasets, and Kiwi performed well. Both baseline and koalanlp performed poorly.


4) Consideration of metrics and Normalized F1 score

The evaluation source code which was copied from kiwipiepy provides both EM score and F1 score (dice similarity). But I don't believe both are good metrics to measure sentence segmentation performance. In this section, I will show you the problems of both EM score and F1 score, and propose a new metric, Normalized F1 score to solve these problems. For these experiments, I used Kiwi (0.14.1) and Word Split, and the Word Split is equivalent to text.split(" ").

4.1) Problem of EM score

Firstly, the EM score has a problem like the following. Let's look at an example like this:

  • Input text:

    ๋ธํฌ์ด ์„ฌ์— ์žˆ๋Š” ์•„ํด๋ก  ์‹ ์ „์€ ์•ž์ผ์„ ์˜ˆ์–ธํ•˜๋Š” ์‹ ํƒ์œผ๋กœ ์œ ๋ช…ํ•˜๋‹ค.[3] ์•„ํด๋ก ์ด ์•„์ง ํƒœ์–ด๋‚˜๊ธฐ ์ด์ „์— ๋ ˆํ† ๋Š”, ์ž์‹ ์ด ์ž„์‹ ํ•œ ์Œ๋‘ฅ์ด๋“ค์ด, ์•„๋ฒ„์ง€์ธ ์ œ์šฐ์Šค ๋‹ค์Œ๊ฐ€๋Š” ๊ถŒ๋ ฅ์„ ๋ˆ„๋ฆฌ๊ฒŒ ๋  ๊ฒƒ์ด๋ผ๋Š” ์˜ˆ์–ธ์„ ๋ฐ›์•˜๋‹ค๊ณ  ํ•œ๋‹ค. 
    
  • Label:

    ๋ธํฌ์ด ์„ฌ์— ์žˆ๋Š” ์•„ํด๋ก  ์‹ ์ „์€ ์•ž์ผ์„ ์˜ˆ์–ธํ•˜๋Š” ์‹ ํƒ์œผ๋กœ ์œ ๋ช…ํ•˜๋‹ค.[3] 
    ์•„ํด๋ก ์ด ์•„์ง ํƒœ์–ด๋‚˜๊ธฐ ์ด์ „์— ๋ ˆํ† ๋Š”, ์ž์‹ ์ด ์ž„์‹ ํ•œ ์Œ๋‘ฅ์ด๋“ค์ด, ์•„๋ฒ„์ง€์ธ ์ œ์šฐ์Šค ๋‹ค์Œ๊ฐ€๋Š” ๊ถŒ๋ ฅ์„ ๋ˆ„๋ฆฌ๊ฒŒ ๋  ๊ฒƒ์ด๋ผ๋Š” ์˜ˆ์–ธ์„ ๋ฐ›์•˜๋‹ค๊ณ  ํ•œ๋‹ค. 
    

And the two splitters split input text like the following:

  • Output of Kiwi (0.14.1):

    # EM score: 0.0
    
    ๋ธํฌ์ด ์„ฌ์— ์žˆ๋Š” ์•„ํด๋ก  ์‹ ์ „์€ ์•ž์ผ์„ ์˜ˆ์–ธํ•˜๋Š” ์‹ ํƒ์œผ๋กœ ์œ ๋ช…ํ•˜๋‹ค.
    [3] ์•„ํด๋ก ์ด ์•„์ง ํƒœ์–ด๋‚˜๊ธฐ ์ด์ „์— ๋ ˆํ† ๋Š”, ์ž์‹ ์ด ์ž„์‹ ํ•œ ์Œ๋‘ฅ์ด๋“ค์ด, ์•„๋ฒ„์ง€์ธ ์ œ์šฐ์Šค ๋‹ค์Œ๊ฐ€๋Š” ๊ถŒ๋ ฅ์„ ๋ˆ„๋ฆฌ๊ฒŒ ๋  ๊ฒƒ์ด๋ผ๋Š” ์˜ˆ์–ธ์„ ๋ฐ›์•˜๋‹ค๊ณ  ํ•œ๋‹ค. 
    
  • Output of Word Split:

    # EM score: 0.0
    
    ๋ธํฌ์ด
    ์„ฌ์—
    ์žˆ๋Š”
    ์•„ํด๋ก 
    ์‹ ์ „์€
    ์•ž์ผ์„
    ์˜ˆ์–ธํ•˜๋Š”
    ์‹ ํƒ์œผ๋กœ
    ์œ ๋ช…ํ•˜๋‹ค.[3]
    ์•„ํด๋ก ์ด
    ์•„์ง
    ํƒœ์–ด๋‚˜๊ธฐ
    ์ด์ „์—
    ๋ ˆํ† ๋Š”,
    ์ž์‹ ์ด
    ์ž„์‹ ํ•œ
    ์Œ๋‘ฅ์ด๋“ค์ด,
    ์•„๋ฒ„์ง€์ธ
    ์ œ์šฐ์Šค
    ๋‹ค์Œ๊ฐ€๋Š”
    ๊ถŒ๋ ฅ์„
    ๋ˆ„๋ฆฌ๊ฒŒ
    ๋ 
    ๊ฒƒ์ด๋ผ๋Š”
    ์˜ˆ์–ธ์„
    ๋ฐ›์•˜๋‹ค๊ณ 
    ํ•œ๋‹ค. 
    

The Kiwi separated sentences well excluding the footnote ([3]). Even if it didn't split sentences exactly accurate, it split somewhat well. On the contrary, the Word Split separated sentences completely wrong. However, since none of these outputs are the same with label, both are rated as 0.0 with EM score. It's too harsh evaluation for Kiwi. As such, the EM score does not properly evaluate the performance in the case of the sentence segmentation is not exactly accurate.

You can reproduce this result using the following commands:

  • Kiwi: python3 ./bench/test_kiwi.py ./bench/metrics/em_problem.txt
  • Word Split: python3 ./bench/test_word_split.py ./bench/metrics/em_problem.txt

4.2) Problem of F1 score

We can utilize the F1 score to solve the problem of EM score. But F1 score has another problem. Let's look at an example like this:

  • Input text:

    ๊ธฐ์–ตํ•ด ๋„Œ ๊ทธ ์• ์˜ ์นœ๊ตฌ์•ผ. ๋„ค๊ฐ€ ์ฃฝ์œผ๋ฉด ๋งˆ ๋“ค๋ ˆ ๋Š๊ฐ€ ํŽ‘ํŽ‘ ์šธ ๊ฑฐ์•ผ ๋น„ ์ฒด๋Š” ์Šฌํผํ•˜๊ฒ ์ง€ ์ด ์•ˆ์€ ํ™”๋ฅผ ๋‚ผ ๊ฑฐ์•ผ. ๋ฉ”์ด ์‹œ๋Š” ์–ด์ฉŒ๋ฉด ์กฐ๊ธˆ์€ ์ƒ๊ฐ ํ•ด ์ฃผ์ง€ ์•Š์„๊นŒ ์ค‘์š”ํ•œ ๊ฑด ๊ทธ๊ฑด ๋„ค๊ฐ€ ์ง€ํ‚ค๊ณ  ์‹ถ์–ด ํ–ˆ๋˜ ์‚ฌ๋žŒ๋“ค์ด์ž–์•„ ์–ด์„œ ๊ฐ€.
    
  • Label:

    ๊ธฐ์–ตํ•ด 
    ๋„Œ ๊ทธ ์• ์˜ ์นœ๊ตฌ์•ผ.
    ๋„ค๊ฐ€ ์ฃฝ์œผ๋ฉด ๋งˆ ๋“ค๋ ˆ ๋Š๊ฐ€ ํŽ‘ํŽ‘ ์šธ ๊ฑฐ์•ผ
    ๋น„ ์ฒด๋Š” ์Šฌํผํ•˜๊ฒ ์ง€
    ์ด ์•ˆ์€ ํ™”๋ฅผ ๋‚ผ ๊ฑฐ์•ผ.
    ๋ฉ”์ด ์‹œ๋Š” ์–ด์ฉŒ๋ฉด ์กฐ๊ธˆ์€ ์ƒ๊ฐ ํ•ด ์ฃผ์ง€ ์•Š์„๊นŒ
    ์ค‘์š”ํ•œ ๊ฑด ๊ทธ๊ฑด ๋„ค๊ฐ€ ์ง€ํ‚ค๊ณ  ์‹ถ์–ด ํ–ˆ๋˜ ์‚ฌ๋žŒ๋“ค์ด์ž–์•„
    ์–ด์„œ ๊ฐ€.
    

And the two splitters split this input like the following:

  • Output of Kiwi (0.14.1):

    F1 score: 0.56229
    
    Output:
    ๊ธฐ์–ตํ•ด ๋„Œ ๊ทธ ์• ์˜ ์นœ๊ตฌ์•ผ.
    ๋„ค๊ฐ€ ์ฃฝ์œผ๋ฉด ๋งˆ ๋“ค๋ ˆ ๋Š๊ฐ€ ํŽ‘ํŽ‘ ์šธ ๊ฑฐ์•ผ
    ๋น„ ์ฒด๋Š” ์Šฌํผํ•˜๊ฒ ์ง€
    ์ด ์•ˆ์€ ํ™”๋ฅผ ๋‚ผ ๊ฑฐ์•ผ.
    ๋ฉ”์ด ์‹œ๋Š” ์–ด์ฉŒ๋ฉด ์กฐ๊ธˆ์€ ์ƒ๊ฐ ํ•ด ์ฃผ์ง€ ์•Š์„๊นŒ ์ค‘์š”ํ•œ ๊ฑด ๊ทธ๊ฑด ๋„ค๊ฐ€ ์ง€ํ‚ค๊ณ  ์‹ถ์–ด ํ–ˆ๋˜ ์‚ฌ๋žŒ๋“ค์ด์ž–์•„ ์–ด์„œ ๊ฐ€.
    
  • Output of Word Split:

    F1 score: 0.58326
    
    Output:
    ๊ธฐ์–ตํ•ด
    ๋„Œ
    ๊ทธ
    ์• ์˜
    ์นœ๊ตฌ์•ผ.
    ๋„ค๊ฐ€
    ์ฃฝ์œผ๋ฉด
    ๋งˆ
    ๋“ค๋ ˆ
    ๋Š๊ฐ€
    ํŽ‘ํŽ‘
    ์šธ
    ๊ฑฐ์•ผ
    ๋น„
    ์ฒด๋Š”
    ์Šฌํผํ•˜๊ฒ ์ง€
    ์ด
    ์•ˆ์€
    ํ™”๋ฅผ
    ๋‚ผ
    ๊ฑฐ์•ผ.
    ๋ฉ”์ด
    ์‹œ๋Š”
    ์–ด์ฉŒ๋ฉด
    ์กฐ๊ธˆ์€
    ์ƒ๊ฐ
    ํ•ด
    ์ฃผ์ง€
    ์•Š์„๊นŒ
    ์ค‘์š”ํ•œ
    ๊ฑด
    ๊ทธ๊ฑด
    ๋„ค๊ฐ€
    ์ง€ํ‚ค๊ณ 
    ์‹ถ์–ด
    ํ–ˆ๋˜
    ์‚ฌ๋žŒ๋“ค์ด์ž–์•„
    ์–ด์„œ
    ๊ฐ€.
    

Neither two splitters split the sentence perfectly, but Kiwi split sentences pretty well. On the contrary, the Word Split separated sentences completely wrong. Interestingly, Word Split's F1 score is 0.58326, which is higher than Kiwi's 0.56229. This means that the F1 score (dice similarity) gives a huge advantage to splitters which separate sentences too finely.

You can reproduce this result using the following commands:

  • Kiwi: python3 ./bench/test_kiwi.py ./bench/metrics/f1_problem.txt
  • Word Split: python3 ./bench/test_word_split.py ./bench/metrics/f1_problem.txt

4.3) Normalized F1 score

To overcome the problems of both EM score and F1 score, I propose a new metric named Normalized F1 score. This can be obtained by the following formula.

Normalized_F1_score = F1_score * min(1, len(golds)/len(preds))

This inherits the advantages of the F1 score, but penalizes splitters which separate sentences too finely. If we re-evaluate the above two cases with the Normalized F1 score, the scores change as follows.

Splitter Library version Input sentences EM score Normalized F1 score
Kiwi 0.14.1 ๋ธํฌ์ด ์„ฌ์— ์žˆ๋Š” ์•„ํด๋ก ... 0.0 0.96341
Word Split N/A ๋ธํฌ์ด ์„ฌ์— ์žˆ๋Š” ์•„ํด๋ก ... 0.0 0.02145
Splitter Library version Input sentences F1 score Normalized F1 score
Kiwi 0.14.1 ๊ธฐ์–ตํ•ด ๋„Œ ๊ทธ ์• ์˜ ์นœ๊ตฌ... 0.56229 0.56229
Word Split N/A ๊ธฐ์–ตํ•ด ๋„Œ ๊ทธ ์• ์˜ ์นœ๊ตฌ... 0.58326 0.11964

In both cases, Word Split scores significantly lower than Kiwi. This means that the Normalized F1 score can complement the EM score and F1 score. That's why I'm introducing this new metric, Normalized F1 to sentence segmentation evaluation.


5) Where does the difference in performance come from? (Qualitative Analysis)

So far, I've conducted quantitative analysis and have been considering evaluation metrics. However, it is meaningless to simply compare them by number. I definitely want you to see the segmentation results. Let's take blogs_ko samples as examples, and compare performance of each library. For this, I will take the best backend of each library (Kss=mecab, Koalanlp=KKMA) on the blogs_ko dataset, because looking results of all backends may make you tired.

Example 1

  • Input text
๊ฑฐ์ œ ๋‚ด๋ ค๊ฐ€๋Š” ๊ธธ์— ํœด๊ฒŒ์†Œ๋ฅผ ๋“ค๋ ธ๋Š”๋ฐ ์ƒˆ๋กœ ์ƒ๊ฒผ๋‚˜๋ณด๋”๋ผ๊ตฌ์š”!? ๋‚จํŽธ๊ณผ ์ €, ๋‘˜ ๋‹ค ๋นต๋Ÿฌ๋ฒ„๋ผ ์ง€๋‚˜์น  ์ˆ˜ ์—†์–ด ๊ตฌ๋งคํ•ด ๋จน์–ด๋ดค๋‹ต๋‹ˆ๋‹น๐Ÿ˜Š ๋ณด์„ฑ๋…น์ฐจํœด๊ฒŒ์†Œ ์•ˆ์œผ๋กœ ๋“ค์–ด์˜ค์‹œ๋ฉด ๋”ฑ ๊ฐ€์šด๋ฐ ์œ„์น˜ํ•ด ์žˆ์–ด์š”ใ…Žใ…Ž ๊ทธ๋ž˜์„œ ์–ด๋Š ๋ฌธ์œผ๋กœ๋ผ๋„ ๋“ค์–ด์˜ค์…”๋„ ๊ฐ€๊น๋‹ต๋‹ˆ๋‹ค๐Ÿ˜‰ ๋ฉ”๋‰ดํŒ์„ ์ด๋ ‡๊ณ , ๊ฐ€๊ฒฉ์€ 2000์›~3000์› ์‚ฌ์ด์— ํ˜•์„ฑ ๋˜์–ด ์žˆ์–ด์š”! ์ด๋Ÿฐ๊ฑฐ ํ•˜๋‚˜ํ•˜๋‚˜ ๋ง›๋ณด๋Š”๊ฑฐ ๋„ˆ๋ฌด ์ข‹์•„ํ•˜๋Š”๋ฐ... ์ง„์ •ํ•˜๊ณ  ์†Œ๋ฏธ๋ฏธ ๋‹จํŒฅ๋นต ํ•˜๋‚˜, ์˜ฅ์ˆ˜์ˆ˜ ์น˜์ฆˆ๋นต ํ•˜๋‚˜, ๊ตฌ๋ฆฌ๋ณผ ํ•˜๋‚˜ ๊ณจ๋ž์Šต๋‹ˆ๋‹ค! ๋‹ค์Œ์— ๊ฐ€๋ฉด ๊ฐ•๋‚ญ์ฝฉ์ด๋ž‘ ๋ฐค ๊ผญ ๋จน์–ด๋ด์•ผ๊ฒ ์–ด์š”๐Ÿ˜™
  • Label
๊ฑฐ์ œ ๋‚ด๋ ค๊ฐ€๋Š” ๊ธธ์— ํœด๊ฒŒ์†Œ๋ฅผ ๋“ค๋ ธ๋Š”๋ฐ ์ƒˆ๋กœ ์ƒ๊ฒผ๋‚˜๋ณด๋”๋ผ๊ตฌ์š”!?
๋‚จํŽธ๊ณผ ์ €, ๋‘˜ ๋‹ค ๋นต๋Ÿฌ๋ฒ„๋ผ ์ง€๋‚˜์น  ์ˆ˜ ์—†์–ด ๊ตฌ๋งคํ•ด ๋จน์–ด๋ดค๋‹ต๋‹ˆ๋‹น๐Ÿ˜Š
๋ณด์„ฑ๋…น์ฐจํœด๊ฒŒ์†Œ ์•ˆ์œผ๋กœ ๋“ค์–ด์˜ค์‹œ๋ฉด ๋”ฑ ๊ฐ€์šด๋ฐ ์œ„์น˜ํ•ด ์žˆ์–ด์š”ใ…Žใ…Ž
๊ทธ๋ž˜์„œ ์–ด๋Š ๋ฌธ์œผ๋กœ๋ผ๋„ ๋“ค์–ด์˜ค์…”๋„ ๊ฐ€๊น๋‹ต๋‹ˆ๋‹ค๐Ÿ˜‰
๋ฉ”๋‰ดํŒ์„ ์ด๋ ‡๊ณ , ๊ฐ€๊ฒฉ์€ 2000์›~3000์› ์‚ฌ์ด์— ํ˜•์„ฑ ๋˜์–ด ์žˆ์–ด์š”!
์ด๋Ÿฐ๊ฑฐ ํ•˜๋‚˜ํ•˜๋‚˜ ๋ง›๋ณด๋Š”๊ฑฐ ๋„ˆ๋ฌด ์ข‹์•„ํ•˜๋Š”๋ฐ... ์ง„์ •ํ•˜๊ณ  ์†Œ๋ฏธ๋ฏธ ๋‹จํŒฅ๋นต ํ•˜๋‚˜, ์˜ฅ์ˆ˜์ˆ˜ ์น˜์ฆˆ๋นต ํ•˜๋‚˜, ๊ตฌ๋ฆฌ๋ณผ ํ•˜๋‚˜ ๊ณจ๋ž์Šต๋‹ˆ๋‹ค!
๋‹ค์Œ์— ๊ฐ€๋ฉด ๊ฐ•๋‚ญ์ฝฉ์ด๋ž‘ ๋ฐค ๊ผญ ๋จน์–ด๋ด์•ผ๊ฒ ์–ด์š”๐Ÿ˜™
  • Source

https://hi-e2e2.tistory.com/193

  • Output texts
Baseline:

๊ฑฐ์ œ ๋‚ด๋ ค๊ฐ€๋Š” ๊ธธ์— ํœด๊ฒŒ์†Œ๋ฅผ ๋“ค๋ ธ๋Š”๋ฐ ์ƒˆ๋กœ ์ƒ๊ฒผ๋‚˜๋ณด๋”๋ผ๊ตฌ์š”!?
๋‚จํŽธ๊ณผ ์ €, ๋‘˜ ๋‹ค ๋นต๋Ÿฌ๋ฒ„๋ผ ์ง€๋‚˜์น  ์ˆ˜ ์—†์–ด ๊ตฌ๋งคํ•ด ๋จน์–ด๋ดค๋‹ต๋‹ˆ๋‹น๐Ÿ˜Š ๋ณด์„ฑ๋…น์ฐจํœด๊ฒŒ์†Œ ์•ˆ์œผ๋กœ ๋“ค์–ด์˜ค์‹œ๋ฉด ๋”ฑ ๊ฐ€์šด๋ฐ ์œ„์น˜ํ•ด ์žˆ์–ด์š”ใ…Žใ…Ž ๊ทธ๋ž˜์„œ ์–ด๋Š ๋ฌธ์œผ๋กœ๋ผ๋„ ๋“ค์–ด์˜ค์…”๋„ ๊ฐ€๊น๋‹ต๋‹ˆ๋‹ค๐Ÿ˜‰ ๋ฉ”๋‰ดํŒ์„ ์ด๋ ‡๊ณ , ๊ฐ€๊ฒฉ์€ 2000์›~3000์› ์‚ฌ์ด์— ํ˜•์„ฑ ๋˜์–ด ์žˆ์–ด์š”!
์ด๋Ÿฐ๊ฑฐ ํ•˜๋‚˜ํ•˜๋‚˜ ๋ง›๋ณด๋Š”๊ฑฐ ๋„ˆ๋ฌด ์ข‹์•„ํ•˜๋Š”๋ฐ...
์ง„์ •ํ•˜๊ณ  ์†Œ๋ฏธ๋ฏธ ๋‹จํŒฅ๋นต ํ•˜๋‚˜, ์˜ฅ์ˆ˜์ˆ˜ ์น˜์ฆˆ๋นต ํ•˜๋‚˜, ๊ตฌ๋ฆฌ๋ณผ ํ•˜๋‚˜ ๊ณจ๋ž์Šต๋‹ˆ๋‹ค!
๋‹ค์Œ์— ๊ฐ€๋ฉด ๊ฐ•๋‚ญ์ฝฉ์ด๋ž‘ ๋ฐค ๊ผญ ๋จน์–ด๋ด์•ผ๊ฒ ์–ด์š”๐Ÿ˜™

Baseline separates input text into 5 sentences. First of all, the first sentence was separated well because it has final symbols. However, since these final symbols don't appear from the second sentence, you can see that these sentences were not separated well.

Koalanlp (KKMA):

๊ฑฐ์ œ ๋‚ด๋ ค๊ฐ€๋Š” ๊ธธ์— ํœด๊ฒŒ ์†Œ๋ฅผ ๋“ค๋ ธ๋Š”๋ฐ ์ƒˆ๋กœ ์ƒ๊ฒผ๋‚˜
๋ณด๋”๋ผ๊ตฌ์š”!?
๋‚จํŽธ๊ณผ ์ €, ๋‘˜ ๋‹ค ๋นต ๋Ÿฌ๋ฒ„๋ผ ์ง€๋‚˜์น  ์ˆ˜ ์—†์–ด ๊ตฌ๋งคํ•ด ๋จน์–ด ๋ดค๋‹ต๋‹ˆ๋‹น
๐Ÿ˜Š ๋ณด์„ฑ ๋…น์ฐจ ํœด๊ฒŒ์†Œ ์•ˆ์œผ๋กœ ๋“ค์–ด์˜ค์‹œ๋ฉด ๋”ฑ ๊ฐ€์šด๋ฐ ์œ„์น˜ํ•ด ์žˆ์–ด์š”
ใ…Žใ…Ž ๊ทธ๋ž˜์„œ ์–ด๋Š ๋ฌธ์œผ๋กœ ๋ผ๋„ ๋“ค์–ด์˜ค์…”๋„ ๊ฐ€๊น๋‹ต๋‹ˆ๋‹ค
๐Ÿ˜‰ ๋ฉ”๋‰ดํŒ์„ ์ด๋ ‡๊ณ , ๊ฐ€๊ฒฉ์€ 2000์› ~3000 ์› ์‚ฌ์ด์— ํ˜•์„ฑ ๋˜์–ด ์žˆ์–ด์š”!
์ด๋Ÿฐ ๊ฑฐ ํ•˜๋‚˜ํ•˜๋‚˜ ๋ง›๋ณด๋Š” ๊ฑฐ ๋„ˆ๋ฌด ์ข‹์•„ํ•˜๋Š”๋ฐ... ์ง„์ •ํ•˜๊ณ  ์†Œ๋ฏธ ๋ฏธ ๋‹จํŒฅ๋นต ํ•˜๋‚˜, ์˜ฅ์ˆ˜์ˆ˜ ์น˜์ฆˆ ๋นต ํ•˜๋‚˜, ๊ตฌ๋ฆฌ ๋ณผ ํ•˜๋‚˜ ๊ณจ๋ž์Šต๋‹ˆ๋‹ค!
๋‹ค์Œ์— ๊ฐ€๋ฉด ๊ฐ•๋‚ญ์ฝฉ์ด๋ž‘ ๋ฐค ๊ผญ ๋จน์–ด๋ด์•ผ๊ฒ ์–ด์š”๐Ÿ˜™

Koalanlp splits sentences better than baseline because it uses morphological information. It splits input text into 8 sentences in total. But many mispartitions still exist. The first thing that catches your eye is the immature emoji handling. People usually put emojis at the end of a sentence, and in this case, the emojis should be included in the sentence. The second thing is the mispartition between ์ƒ๊ฒผ๋‚˜ and ๋ณด๋”๋ผ๊ตฌ์š”!?. Probably this is because the KKMA morpheme analyzer recognized ์ƒ๊ฒผ๋‚˜ as a final eomi (์ข…๊ฒฐ์–ด๋ฏธ). but it's a connecting eomi (์—ฐ๊ฒฐ์–ด๋ฏธ). This is because the performance of the morpheme analyzer. Rather, the baseline is a little safer in this area.

Kiwi:

๊ฑฐ์ œ ๋‚ด๋ ค๊ฐ€๋Š” ๊ธธ์— ํœด๊ฒŒ์†Œ๋ฅผ ๋“ค๋ ธ๋Š”๋ฐ ์ƒˆ๋กœ ์ƒ๊ฒผ๋‚˜๋ณด๋”๋ผ๊ตฌ์š”!?
๋‚จํŽธ๊ณผ ์ €, ๋‘˜ ๋‹ค ๋นต๋Ÿฌ๋ฒ„๋ผ ์ง€๋‚˜์น  ์ˆ˜ ์—†์–ด ๊ตฌ๋งคํ•ด ๋จน์–ด๋ดค๋‹ต๋‹ˆ๋‹น๐Ÿ˜Š
๋ณด์„ฑ๋…น์ฐจํœด๊ฒŒ์†Œ ์•ˆ์œผ๋กœ ๋“ค์–ด์˜ค์‹œ๋ฉด ๋”ฑ ๊ฐ€์šด๋ฐ ์œ„์น˜ํ•ด ์žˆ์–ด์š”ใ…Žใ…Ž
๊ทธ๋ž˜์„œ ์–ด๋Š ๋ฌธ์œผ๋กœ๋ผ๋„ ๋“ค์–ด์˜ค์…”๋„ ๊ฐ€๊น๋‹ต๋‹ˆ๋‹ค๐Ÿ˜‰ ๋ฉ”๋‰ดํŒ์„ ์ด๋ ‡๊ณ , ๊ฐ€๊ฒฉ์€ 2000์›~3000์› ์‚ฌ์ด์— ํ˜•์„ฑ ๋˜์–ด ์žˆ์–ด์š”!
์ด๋Ÿฐ๊ฑฐ ํ•˜๋‚˜ํ•˜๋‚˜ ๋ง›๋ณด๋Š”๊ฑฐ ๋„ˆ๋ฌด ์ข‹์•„ํ•˜๋Š”๋ฐ...
์ง„์ •ํ•˜๊ณ  ์†Œ๋ฏธ๋ฏธ ๋‹จํŒฅ๋นต ํ•˜๋‚˜, ์˜ฅ์ˆ˜์ˆ˜ ์น˜์ฆˆ๋นต ํ•˜๋‚˜, ๊ตฌ๋ฆฌ๋ณผ ํ•˜๋‚˜ ๊ณจ๋ž์Šต๋‹ˆ๋‹ค!
๋‹ค์Œ์— ๊ฐ€๋ฉด ๊ฐ•๋‚ญ์ฝฉ์ด๋ž‘ ๋ฐค ๊ผญ ๋จน์–ด๋ด์•ผ๊ฒ ์–ด์š”๐Ÿ˜™

Kiwi shows better performance than Koalanlp. It splits input text into 7 sentences. Most sentences are pretty good, but it doesn't split ๊ฐ€๊น๋‹ต๋‹ˆ๋‹ค๐Ÿ˜‰ and ๋ฉ”๋‰ดํŒ์„. The second thing is it separates ์ข‹์•„ํ•˜๋Š”๋ฐ... and ์ง„์ •ํ•˜๊ณ . This part may be recognized as an independent sentence depending on the viewer, but the author of the original article didn't write this as an independent sentence, but an embraced sentence (์•ˆ๊ธด๋ฌธ์žฅ).

The original article was written like:

Kss (mecab):

๊ฑฐ์ œ ๋‚ด๋ ค๊ฐ€๋Š” ๊ธธ์— ํœด๊ฒŒ์†Œ๋ฅผ ๋“ค๋ ธ๋Š”๋ฐ ์ƒˆ๋กœ ์ƒ๊ฒผ๋‚˜๋ณด๋”๋ผ๊ตฌ์š”!?
๋‚จํŽธ๊ณผ ์ €, ๋‘˜ ๋‹ค ๋นต๋Ÿฌ๋ฒ„๋ผ ์ง€๋‚˜์น  ์ˆ˜ ์—†์–ด ๊ตฌ๋งคํ•ด ๋จน์–ด๋ดค๋‹ต๋‹ˆ๋‹น๐Ÿ˜Š
๋ณด์„ฑ๋…น์ฐจํœด๊ฒŒ์†Œ ์•ˆ์œผ๋กœ ๋“ค์–ด์˜ค์‹œ๋ฉด ๋”ฑ ๊ฐ€์šด๋ฐ ์œ„์น˜ํ•ด ์žˆ์–ด์š”ใ…Žใ…Ž
๊ทธ๋ž˜์„œ ์–ด๋Š ๋ฌธ์œผ๋กœ๋ผ๋„ ๋“ค์–ด์˜ค์…”๋„ ๊ฐ€๊น๋‹ต๋‹ˆ๋‹ค๐Ÿ˜‰
๋ฉ”๋‰ดํŒ์„ ์ด๋ ‡๊ณ , ๊ฐ€๊ฒฉ์€ 2000์›~3000์› ์‚ฌ์ด์— ํ˜•์„ฑ ๋˜์–ด ์žˆ์–ด์š”!
์ด๋Ÿฐ๊ฑฐ ํ•˜๋‚˜ํ•˜๋‚˜ ๋ง›๋ณด๋Š”๊ฑฐ ๋„ˆ๋ฌด ์ข‹์•„ํ•˜๋Š”๋ฐ... ์ง„์ •ํ•˜๊ณ  ์†Œ๋ฏธ๋ฏธ ๋‹จํŒฅ๋นต ํ•˜๋‚˜, ์˜ฅ์ˆ˜์ˆ˜ ์น˜์ฆˆ๋นต ํ•˜๋‚˜, ๊ตฌ๋ฆฌ๋ณผ ํ•˜๋‚˜ ๊ณจ๋ž์Šต๋‹ˆ๋‹ค!
๋‹ค์Œ์— ๊ฐ€๋ฉด ๊ฐ•๋‚ญ์ฝฉ์ด๋ž‘ ๋ฐค ๊ผญ ๋จน์–ด๋ด์•ผ๊ฒ ์–ด์š”๐Ÿ˜™

The result of Kss is same with gold label. Especially it succesfully separates ๊ฐ€๊น๋‹ต๋‹ˆ๋‹ค๐Ÿ˜‰ and ๋ฉ”๋‰ดํŒ์„. In fact, this part is the final eomi (์ข…๊ฒฐ์–ด๋ฏธ), but many morpheme analyzers confuse the final eomi (์ข…๊ฒฐ์–ด๋ฏธ) with the connecting eomi (์—ฐ๊ฒฐ์–ด๋ฏธ). Actually, mecab and pecab morpheme analyzers which are backend of Kss also recognizes that part as a connecting eomi (์—ฐ๊ฒฐ์–ด๋ฏธ). For this reason, Kss has a feature to recognize wrongly recognized connecting eomi (์—ฐ๊ฒฐ์–ด๋ฏธ) and to correct those eomis. Thus, it is able to separate this part effectively. Next, Kss doesn't split ์ข‹์•„ํ•˜๋Š”๋ฐ... and ์ง„์ •ํ•˜๊ณ  becuase ์ข‹์•„ํ•˜๋Š”๋ฐ... is not an independent sentence, but an embraced sentence (์•ˆ๊ธด๋ฌธ์žฅ). This means Kss doesn't split sentences simply because . appears, unlike baseline. In most cases, . could be the delimiter of sentences, actually there are many exceptions about this.

Example 2

  • Input text
์–ด๋Šํ™”์ฐฝํ•œ๋‚  ์ถœ๊ทผ์ „์— ๋„ˆ๋ฌด์ผ์ฐ์ผ์–ด๋‚˜ ๋ฒ„๋ ธ์Œ (์ถœ๊ทผ์‹œ๊ฐ„ 19์‹œ) ํ• ๊บผ๋„์—†๊ณ ํ•ด์„œ ์นดํŽ˜๋ฅผ ์ฐพ์•„ ์‹œ๋‚ด๋กœ ๋‚˜๊ฐ”์Œ ์ƒˆ๋กœ์ƒ๊ธด๊ณณ์— ์‚ฌ์žฅ๋‹˜์ด ์ปคํ”ผ์„ ์ˆ˜์ธ์ง€ ์ปคํ”ผ๋ฐ•์‚ฌ๋ผ๊ณ  ํ•ด์„œ ๊ฐ”์Œ ์˜คํ”ˆํ•œ์ง€ ์–ผ๋งˆ์•ˆ๋˜์„œ ๊ทธ๋Ÿฐ์ง€ ์†๋‹˜์ด ์–ผ๋งˆ์—†์—ˆ์Œ ์กฐ์šฉํ•˜๊ณ  ์ข‹๋‹ค๋ฉฐ ์ข‹์•„ํ•˜๋Š”๊ฑธ์‹œ์ผœ์„œ ํ…Œ๋ผ์Šค์— ์•‰์Œ ๊ทผ๋ฐ ์กฐ์šฉํ•˜๋˜ ์นดํŽ˜๊ฐ€ ์‚ฐ๋งŒํ•ด์ง ์†Œ๋ฆฌ์˜ ์ถœ์ฒ˜๋Š” ์นด์šดํ„ฐ์˜€์Œ(ํ…Œ๋ผ์Šค๊ฐ€ ์นด์šดํ„ฐ ๋ฐ”๋กœ์˜†) ๋“ค์„๋ผ๊ณ  ๋“ค์€๊ฒŒ ์•„๋‹ˆ๋ผ ๊ท€๋Š” ์—ด๋ ค์žˆ์œผ๋‹ˆ ๋“ฃ๊ฒŒ๋œ ๋Œ€์‚ฌ.
  • Label
์–ด๋Šํ™”์ฐฝํ•œ๋‚  ์ถœ๊ทผ์ „์— ๋„ˆ๋ฌด์ผ์ฐ์ผ์–ด๋‚˜ ๋ฒ„๋ ธ์Œ (์ถœ๊ทผ์‹œ๊ฐ„ 19์‹œ)
ํ• ๊บผ๋„์—†๊ณ ํ•ด์„œ ์นดํŽ˜๋ฅผ ์ฐพ์•„ ์‹œ๋‚ด๋กœ ๋‚˜๊ฐ”์Œ
์ƒˆ๋กœ์ƒ๊ธด๊ณณ์— ์‚ฌ์žฅ๋‹˜์ด ์ปคํ”ผ์„ ์ˆ˜์ธ์ง€ ์ปคํ”ผ๋ฐ•์‚ฌ๋ผ๊ณ  ํ•ด์„œ ๊ฐ”์Œ
์˜คํ”ˆํ•œ์ง€ ์–ผ๋งˆ์•ˆ๋˜์„œ ๊ทธ๋Ÿฐ์ง€ ์†๋‹˜์ด ์–ผ๋งˆ์—†์—ˆ์Œ
์กฐ์šฉํ•˜๊ณ  ์ข‹๋‹ค๋ฉฐ ์ข‹์•„ํ•˜๋Š”๊ฑธ์‹œ์ผœ์„œ ํ…Œ๋ผ์Šค์— ์•‰์Œ
๊ทผ๋ฐ ์กฐ์šฉํ•˜๋˜ ์นดํŽ˜๊ฐ€ ์‚ฐ๋งŒํ•ด์ง
์†Œ๋ฆฌ์˜ ์ถœ์ฒ˜๋Š” ์นด์šดํ„ฐ์˜€์Œ(ํ…Œ๋ผ์Šค๊ฐ€ ์นด์šดํ„ฐ ๋ฐ”๋กœ์˜†)
๋“ค์„๋ผ๊ณ  ๋“ค์€๊ฒŒ ์•„๋‹ˆ๋ผ ๊ท€๋Š” ์—ด๋ ค์žˆ์œผ๋‹ˆ ๋“ฃ๊ฒŒ๋œ ๋Œ€์‚ฌ.
  • Source

https://mrsign92.tistory.com/6099371

  • Output texts
Baseline:

์–ด๋Šํ™”์ฐฝํ•œ๋‚  ์ถœ๊ทผ์ „์— ๋„ˆ๋ฌด์ผ์ฐ์ผ์–ด๋‚˜ ๋ฒ„๋ ธ์Œ (์ถœ๊ทผ์‹œ๊ฐ„ 19์‹œ) ํ• ๊บผ๋„์—†๊ณ ํ•ด์„œ ์นดํŽ˜๋ฅผ ์ฐพ์•„ ์‹œ๋‚ด๋กœ ๋‚˜๊ฐ”์Œ ์ƒˆ๋กœ์ƒ๊ธด๊ณณ์— ์‚ฌ์žฅ๋‹˜์ด ์ปคํ”ผ์„ ์ˆ˜์ธ์ง€ ์ปคํ”ผ๋ฐ•์‚ฌ๋ผ๊ณ  ํ•ด์„œ ๊ฐ”์Œ ์˜คํ”ˆํ•œ์ง€ ์–ผ๋งˆ์•ˆ๋˜์„œ ๊ทธ๋Ÿฐ์ง€ ์†๋‹˜์ด ์–ผ๋งˆ์—†์—ˆ์Œ ์กฐ์šฉํ•˜๊ณ  ์ข‹๋‹ค๋ฉฐ ์ข‹์•„ํ•˜๋Š”๊ฑธ์‹œ์ผœ์„œ ํ…Œ๋ผ์Šค์— ์•‰์Œ ๊ทผ๋ฐ ์กฐ์šฉํ•˜๋˜ ์นดํŽ˜๊ฐ€ ์‚ฐ๋งŒํ•ด์ง ์†Œ๋ฆฌ์˜ ์ถœ์ฒ˜๋Š” ์นด์šดํ„ฐ์˜€์Œ(ํ…Œ๋ผ์Šค๊ฐ€ ์นด์šดํ„ฐ ๋ฐ”๋กœ์˜†) ๋“ค์„๋ผ๊ณ  ๋“ค์€๊ฒŒ ์•„๋‹ˆ๋ผ ๊ท€๋Š” ์—ด๋ ค์žˆ์œผ๋‹ˆ ๋“ฃ๊ฒŒ๋œ ๋Œ€์‚ฌ.

Baseline doesn't split any sentences because there's no .!? in the input text.

Koalanlp (KKMA)

์–ด๋Š ํ™”์ฐฝํ•œ ๋‚  ์ถœ๊ทผ ์ „์— ๋„ˆ๋ฌด ์ผ์ฐ ์ผ์–ด๋‚˜ ๋ฒ„๋ ธ์Œ ( ์ถœ๊ทผ์‹œ๊ฐ„ 19์‹œ) ํ•  ๊บผ๋„ ์—†๊ณ  ํ•ด์„œ ์นดํŽ˜๋ฅผ ์ฐพ์•„ ์‹œ๋‚ด๋กœ ๋‚˜๊ฐ”์Œ ์ƒˆ๋กœ ์ƒ๊ธด ๊ณณ์— ์‚ฌ์žฅ๋‹˜์ด ์ปคํ”ผ์„ ์ˆ˜์ธ์ง€ ์ปคํ”ผ๋ฐ•์‚ฌ๋ผ๊ณ  ํ•ด์„œ ๊ฐ”์Œ ์˜คํ”ˆํ•œ์ง€ ์–ผ๋งˆ ์•ˆ ๋˜ ์„œ ๊ทธ๋Ÿฐ์ง€ ์†๋‹˜์ด ์–ผ๋งˆ ์—†์—ˆ์Œ ์กฐ์šฉํ•˜๊ณ  ์ข‹๋‹ค๋ฉฐ ์ข‹์•„ํ•˜๋Š” ๊ฑธ ์‹œ์ผœ์„œ ํ…Œ๋ผ์Šค์— ์•‰์Œ ๊ทผ๋ฐ ์กฐ์šฉํ•˜๋˜ ์นดํŽ˜๊ฐ€ ์‚ฐ๋งŒ ํ•ด์ง ์†Œ๋ฆฌ์˜ ์ถœ์ฒ˜๋Š” ์นด์šดํ„ฐ์˜€์Œ( ํ…Œ๋ผ์Šค๊ฐ€ ์นด์šดํ„ฐ ๋ฐ”๋กœ ์˜†) ๋“ค์„๋ผ๊ณ 
๋“ค์€ ๊ฒŒ ์•„๋‹ˆ๋ผ ๊ท€๋Š” ์—ด๋ ค ์žˆ์œผ๋‹ˆ ๋“ฃ๊ฒŒ ๋œ ๋Œ€์‚ฌ.

Koalanlp separates ๋“ค์„๋ผ๊ณ  and ๋“ค์€ but it is not correct split point. And I think it doesn't consider predicative use of eomi transferred from noun (๋ช…์‚ฌํ˜• ์ „์„ฑ์–ด๋ฏธ์˜ ์„œ์ˆ ์  ์šฉ๋ฒ•).

Kiwi

์–ด๋Šํ™”์ฐฝํ•œ๋‚  ์ถœ๊ทผ์ „์— ๋„ˆ๋ฌด์ผ์ฐ์ผ์–ด๋‚˜ ๋ฒ„๋ ธ์Œ (์ถœ๊ทผ์‹œ๊ฐ„ 19์‹œ) ํ• ๊บผ๋„์—†๊ณ ํ•ด์„œ ์นดํŽ˜๋ฅผ ์ฐพ์•„ ์‹œ๋‚ด๋กœ ๋‚˜๊ฐ”์Œ ์ƒˆ๋กœ์ƒ๊ธด๊ณณ์— ์‚ฌ์žฅ๋‹˜์ด ์ปคํ”ผ์„ ์ˆ˜์ธ์ง€ ์ปคํ”ผ๋ฐ•์‚ฌ๋ผ๊ณ  ํ•ด์„œ ๊ฐ”์Œ ์˜คํ”ˆํ•œ์ง€ ์–ผ๋งˆ์•ˆ๋˜์„œ ๊ทธ๋Ÿฐ์ง€ ์†๋‹˜์ด ์–ผ๋งˆ์—†์—ˆ์Œ ์กฐ์šฉํ•˜๊ณ  ์ข‹๋‹ค๋ฉฐ ์ข‹์•„ํ•˜๋Š”๊ฑธ์‹œ์ผœ์„œ ํ…Œ๋ผ์Šค์— ์•‰์Œ ๊ทผ๋ฐ ์กฐ์šฉํ•˜๋˜ ์นดํŽ˜๊ฐ€ ์‚ฐ๋งŒํ•ด์ง ์†Œ๋ฆฌ์˜ ์ถœ์ฒ˜๋Š” ์นด์šดํ„ฐ์˜€์Œ(ํ…Œ๋ผ์Šค๊ฐ€ ์นด์šดํ„ฐ ๋ฐ”๋กœ์˜†) ๋“ค์„๋ผ๊ณ  ๋“ค์€๊ฒŒ ์•„๋‹ˆ๋ผ ๊ท€๋Š” ์—ด๋ ค์žˆ์œผ๋‹ˆ ๋“ฃ๊ฒŒ๋œ ๋Œ€์‚ฌ.

Kiwi doesn't separate any sentence, similar with baseline. Similarly, it doesn't consider predicative use of eomi transferred from noun (๋ช…์‚ฌํ˜• ์ „์„ฑ์–ด๋ฏธ์˜ ์„œ์ˆ ์  ์šฉ๋ฒ•).

Kss (Mecab)

์–ด๋Šํ™”์ฐฝํ•œ๋‚  ์ถœ๊ทผ์ „์— ๋„ˆ๋ฌด์ผ์ฐ์ผ์–ด๋‚˜ ๋ฒ„๋ ธ์Œ (์ถœ๊ทผ์‹œ๊ฐ„ 19์‹œ)
ํ• ๊บผ๋„์—†๊ณ ํ•ด์„œ ์นดํŽ˜๋ฅผ ์ฐพ์•„ ์‹œ๋‚ด๋กœ ๋‚˜๊ฐ”์Œ
์ƒˆ๋กœ์ƒ๊ธด๊ณณ์— ์‚ฌ์žฅ๋‹˜์ด ์ปคํ”ผ์„ ์ˆ˜์ธ์ง€ ์ปคํ”ผ๋ฐ•์‚ฌ๋ผ๊ณ  ํ•ด์„œ ๊ฐ”์Œ
์˜คํ”ˆํ•œ์ง€ ์–ผ๋งˆ์•ˆ๋˜์„œ ๊ทธ๋Ÿฐ์ง€ ์†๋‹˜์ด ์–ผ๋งˆ์—†์—ˆ์Œ
์กฐ์šฉํ•˜๊ณ  ์ข‹๋‹ค๋ฉฐ ์ข‹์•„ํ•˜๋Š”๊ฑธ์‹œ์ผœ์„œ ํ…Œ๋ผ์Šค์— ์•‰์Œ
๊ทผ๋ฐ ์กฐ์šฉํ•˜๋˜ ์นดํŽ˜๊ฐ€ ์‚ฐ๋งŒํ•ด์ง ์†Œ๋ฆฌ์˜ ์ถœ์ฒ˜๋Š” ์นด์šดํ„ฐ์˜€์Œ(ํ…Œ๋ผ์Šค๊ฐ€ ์นด์šดํ„ฐ ๋ฐ”๋กœ์˜†)
๋“ค์„๋ผ๊ณ  ๋“ค์€๊ฒŒ ์•„๋‹ˆ๋ผ ๊ท€๋Š” ์—ด๋ ค์žˆ์œผ๋‹ˆ ๋“ฃ๊ฒŒ๋œ ๋Œ€์‚ฌ.

The result of Kss is very similar with gold label, Kss considers predicative use of eomi transferred from noun (๋ช…์‚ฌํ˜• ์ „์„ฑ์–ด๋ฏธ์˜ ์„œ์ˆ ์  ์šฉ๋ฒ•). But Kss couldn't split ์‚ฐ๋งŒํ•ด์ง and ์†Œ๋ฆฌ์˜. That part is a correct split point, but it was blocked by one of the exceptions which I built to prevent wrong segmentation. Splitting eomi transferred from noun (๋ช…์‚ฌํ˜• ์ „์„ฑ์–ด๋ฏธ) is one of the unsafe and difficult tasks, so Kss has many exceptions to prevent wrong segmentation.

Example 3

  • Input text
์ฑ…์†Œ๊ฐœ์— ์ด๊ฑด ์†Œ์„ค์ธ๊ฐ€ ์‹ค์ œ์ธ๊ฐ€๋ผ๋Š” ๋ฌธ๊ตฌ๋ฅผ ๋ณด๊ณ  ์žฌ๋ฐŒ๊ฒ ๋‹ค ์‹ถ์–ด ๋ณด๊ฒŒ ๋˜์—ˆ๋‹ค. '๋ฐ”์นด๋ผ'๋ผ๋Š” ๋„๋ฐ•์€ 2์žฅ์˜ ์นด๋“œ ํ•ฉ์ด ๋†’์€ ์‚ฌ๋žŒ์ด ์ด๊ธฐ๋Š” ๊ฒŒ์ž„์œผ๋กœ ์•„์ฃผ ๋‹จ์ˆœํ•œ ๊ฒŒ์ž„์ด๋‹ค. ์ด๋Ÿฐ๊ฒŒ ์ค‘๋…์ด ๋˜๋‚˜? ์‹ถ์—ˆ๋Š”๋ฐ ์ด ์ฑ…์ด ๋ฐ”์นด๋ผ์™€ ๋น„์Šทํ•œ ๋งค๋ ฅ์ด ์žˆ๋‹ค ์ƒ๊ฐ๋“ค์—ˆ๋‹ค. ๋‚ด์šฉ์ด ์Šคํ”ผ๋“œํ•˜๊ฒŒ ์ง„ํ–‰๋˜๊ณ  ๋ง‰ํžˆ๋Š” ๊ตฌ๊ฐ„์—†์ด ์ฝํžˆ๋Š”๊ฒŒ ๋‚˜๋„ ๋ชจ๋ฅด๊ฒŒ ํŽ˜์ด์ง€๋ฅผ ์Šฅ์Šฅ ๋„˜๊ธฐ๊ณ  ์žˆ์—ˆ๋‹ค. ๋ฌผ๋ก  ์ฝ์Œ์œผ๋กœ์จ ํฐ ๋ˆ์„ ๋ฒŒ์ง„ ์•Š์ง€๋งŒ ์ด๋Ÿฐ ์Šคํ”ผ๋“œํ•จ์— ๋‚˜๋„ ๋ชจ๋ฅด๊ฒŒ ๊ณ„์† ๊ฒŒ์ž„์— ์ฐธ์—ฌํ•˜๊ฒŒ ๋˜๊ณ  ๋‚˜์˜ค๋Š” ํƒ€์ด๋ฐ์„ ์žก์ง€ ๋ชปํ•ด ๋น ์ง€์ง€ ์•Š์•˜์„๊นŒ? ๋ผ๋Š” ์ƒ๊ฐ์„ ํ•˜๊ฒŒ ๋๋‹ค. ์ด ์ฑ…์—์„œ ํ˜„์ง€์˜ ๊ฟˆ์€ ๊ฐ€๊ฒฉํ‘œ๋ฅผ ๋ณด์ง€ ์•Š๋Š” ์‚ถ์ด๋ผ ํ•œ๋‹ค. ์ด ๋ถ€๋ถ„์„ ์ฝ๊ณ  ๋‚˜๋ˆ๋ฐ! ๋ผ๋Š” ์ƒ๊ฐํ•˜๋ฉด์„œ ์ˆœ๊ฐ„ ๋„๋ฐ•์ด๋ผ๋Š”๊ฑธ๋กœ๋ผ๋„ ๋ˆ์„ ๋งŽ์ด ๋ฒŒ์—ˆ๋˜ ํ˜„์ง€๊ฐ€ ๋ถ€๋Ÿฌ์› ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด์„œ ๋‚ด๊ฐ€ ๋„๋ฐ•์„ ํ–ˆ๋‹ค๋ฉด?๋ผ๋Š” ์ƒ์ƒ์„ ํ•ด๋ดค๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด๋Ÿฐ ์ƒ์ƒ์„ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“ค์–ด์ค˜์„œ ์ด ์ฑ…์ด ๋” ์žฌ๋ฐŒ๊ฒŒ ๋‹ค๊ฐ€์™”๋‹ค. ์ผ์ƒ์— ์ง€๋ฃจํ•จ์„ ๋Š๊ปด ๋„๋ฐ•๊ฐ™์€ ์‚ถ์„ ์‚ด๊ณ ์‹ถ๋‹ค๋ฉด ๋„๋ฐ•ํ•˜์ง€๋ง๊ณ  ์ฐจ๋ผ๋ฆฌ ์ด ์ฑ…์„ ๋ณด๊ธธ^^ใ…‹ 
  • Label
์ฑ…์†Œ๊ฐœ์— ์ด๊ฑด ์†Œ์„ค์ธ๊ฐ€ ์‹ค์ œ์ธ๊ฐ€๋ผ๋Š” ๋ฌธ๊ตฌ๋ฅผ ๋ณด๊ณ  ์žฌ๋ฐŒ๊ฒ ๋‹ค ์‹ถ์–ด ๋ณด๊ฒŒ ๋˜์—ˆ๋‹ค.
'๋ฐ”์นด๋ผ'๋ผ๋Š” ๋„๋ฐ•์€ 2์žฅ์˜ ์นด๋“œ ํ•ฉ์ด ๋†’์€ ์‚ฌ๋žŒ์ด ์ด๊ธฐ๋Š” ๊ฒŒ์ž„์œผ๋กœ ์•„์ฃผ ๋‹จ์ˆœํ•œ ๊ฒŒ์ž„์ด๋‹ค.
์ด๋Ÿฐ๊ฒŒ ์ค‘๋…์ด ๋˜๋‚˜? ์‹ถ์—ˆ๋Š”๋ฐ ์ด ์ฑ…์ด ๋ฐ”์นด๋ผ์™€ ๋น„์Šทํ•œ ๋งค๋ ฅ์ด ์žˆ๋‹ค ์ƒ๊ฐ๋“ค์—ˆ๋‹ค.
๋‚ด์šฉ์ด ์Šคํ”ผ๋“œํ•˜๊ฒŒ ์ง„ํ–‰๋˜๊ณ  ๋ง‰ํžˆ๋Š” ๊ตฌ๊ฐ„์—†์ด ์ฝํžˆ๋Š”๊ฒŒ ๋‚˜๋„ ๋ชจ๋ฅด๊ฒŒ ํŽ˜์ด์ง€๋ฅผ ์Šฅ์Šฅ ๋„˜๊ธฐ๊ณ  ์žˆ์—ˆ๋‹ค.
๋ฌผ๋ก  ์ฝ์Œ์œผ๋กœ์จ ํฐ ๋ˆ์„ ๋ฒŒ์ง„ ์•Š์ง€๋งŒ ์ด๋Ÿฐ ์Šคํ”ผ๋“œํ•จ์— ๋‚˜๋„ ๋ชจ๋ฅด๊ฒŒ ๊ณ„์† ๊ฒŒ์ž„์— ์ฐธ์—ฌํ•˜๊ฒŒ ๋˜๊ณ  ๋‚˜์˜ค๋Š” ํƒ€์ด๋ฐ์„ ์žก์ง€ ๋ชปํ•ด ๋น ์ง€์ง€ ์•Š์•˜์„๊นŒ? ๋ผ๋Š” ์ƒ๊ฐ์„ ํ•˜๊ฒŒ ๋๋‹ค.
์ด ์ฑ…์—์„œ ํ˜„์ง€์˜ ๊ฟˆ์€ ๊ฐ€๊ฒฉํ‘œ๋ฅผ ๋ณด์ง€ ์•Š๋Š” ์‚ถ์ด๋ผ ํ•œ๋‹ค.
์ด ๋ถ€๋ถ„์„ ์ฝ๊ณ  ๋‚˜๋ˆ๋ฐ! ๋ผ๋Š” ์ƒ๊ฐํ•˜๋ฉด์„œ ์ˆœ๊ฐ„ ๋„๋ฐ•์ด๋ผ๋Š”๊ฑธ๋กœ๋ผ๋„ ๋ˆ์„ ๋งŽ์ด ๋ฒŒ์—ˆ๋˜ ํ˜„์ง€๊ฐ€ ๋ถ€๋Ÿฌ์› ๋‹ค.
๊ทธ๋Ÿฌ๋ฉด์„œ ๋‚ด๊ฐ€ ๋„๋ฐ•์„ ํ–ˆ๋‹ค๋ฉด?๋ผ๋Š” ์ƒ์ƒ์„ ํ•ด๋ดค๋‹ค.
๊ทธ๋ฆฌ๊ณ  ์ด๋Ÿฐ ์ƒ์ƒ์„ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“ค์–ด์ค˜์„œ ์ด ์ฑ…์ด ๋” ์žฌ๋ฐŒ๊ฒŒ ๋‹ค๊ฐ€์™”๋‹ค.
์ผ์ƒ์— ์ง€๋ฃจํ•จ์„ ๋Š๊ปด ๋„๋ฐ•๊ฐ™์€ ์‚ถ์„ ์‚ด๊ณ ์‹ถ๋‹ค๋ฉด ๋„๋ฐ•ํ•˜์ง€๋ง๊ณ  ์ฐจ๋ผ๋ฆฌ ์ด ์ฑ…์„ ๋ณด๊ธธ^^ใ…‹ 
  • Source

https://hi-e2e2.tistory.com/63

  • Output texts
Baseline:

์ฑ…์†Œ๊ฐœ์— ์ด๊ฑด ์†Œ์„ค์ธ๊ฐ€ ์‹ค์ œ์ธ๊ฐ€๋ผ๋Š” ๋ฌธ๊ตฌ๋ฅผ ๋ณด๊ณ  ์žฌ๋ฐŒ๊ฒ ๋‹ค ์‹ถ์–ด ๋ณด๊ฒŒ ๋˜์—ˆ๋‹ค.
'๋ฐ”์นด๋ผ'๋ผ๋Š” ๋„๋ฐ•์€ 2์žฅ์˜ ์นด๋“œ ํ•ฉ์ด ๋†’์€ ์‚ฌ๋žŒ์ด ์ด๊ธฐ๋Š” ๊ฒŒ์ž„์œผ๋กœ ์•„์ฃผ ๋‹จ์ˆœํ•œ ๊ฒŒ์ž„์ด๋‹ค.
์ด๋Ÿฐ๊ฒŒ ์ค‘๋…์ด ๋˜๋‚˜?
์‹ถ์—ˆ๋Š”๋ฐ ์ด ์ฑ…์ด ๋ฐ”์นด๋ผ์™€ ๋น„์Šทํ•œ ๋งค๋ ฅ์ด ์žˆ๋‹ค ์ƒ๊ฐ๋“ค์—ˆ๋‹ค.
๋‚ด์šฉ์ด ์Šคํ”ผ๋“œํ•˜๊ฒŒ ์ง„ํ–‰๋˜๊ณ  ๋ง‰ํžˆ๋Š” ๊ตฌ๊ฐ„์—†์ด ์ฝํžˆ๋Š”๊ฒŒ ๋‚˜๋„ ๋ชจ๋ฅด๊ฒŒ ํŽ˜์ด์ง€๋ฅผ ์Šฅ์Šฅ ๋„˜๊ธฐ๊ณ  ์žˆ์—ˆ๋‹ค.
๋ฌผ๋ก  ์ฝ์Œ์œผ๋กœ์จ ํฐ ๋ˆ์„ ๋ฒŒ์ง„ ์•Š์ง€๋งŒ ์ด๋Ÿฐ ์Šคํ”ผ๋“œํ•จ์— ๋‚˜๋„ ๋ชจ๋ฅด๊ฒŒ ๊ณ„์† ๊ฒŒ์ž„์— ์ฐธ์—ฌํ•˜๊ฒŒ ๋˜๊ณ  ๋‚˜์˜ค๋Š” ํƒ€์ด๋ฐ์„ ์žก์ง€ ๋ชปํ•ด ๋น ์ง€์ง€ ์•Š์•˜์„๊นŒ?
๋ผ๋Š” ์ƒ๊ฐ์„ ํ•˜๊ฒŒ ๋๋‹ค.
์ด ์ฑ…์—์„œ ํ˜„์ง€์˜ ๊ฟˆ์€ ๊ฐ€๊ฒฉํ‘œ๋ฅผ ๋ณด์ง€ ์•Š๋Š” ์‚ถ์ด๋ผ ํ•œ๋‹ค.
์ด ๋ถ€๋ถ„์„ ์ฝ๊ณ  ๋‚˜๋ˆ๋ฐ!
๋ผ๋Š” ์ƒ๊ฐํ•˜๋ฉด์„œ ์ˆœ๊ฐ„ ๋„๋ฐ•์ด๋ผ๋Š”๊ฑธ๋กœ๋ผ๋„ ๋ˆ์„ ๋งŽ์ด ๋ฒŒ์—ˆ๋˜ ํ˜„์ง€๊ฐ€ ๋ถ€๋Ÿฌ์› ๋‹ค.
๊ทธ๋Ÿฌ๋ฉด์„œ ๋‚ด๊ฐ€ ๋„๋ฐ•์„ ํ–ˆ๋‹ค๋ฉด?๋ผ๋Š” ์ƒ์ƒ์„ ํ•ด๋ดค๋‹ค.
๊ทธ๋ฆฌ๊ณ  ์ด๋Ÿฐ ์ƒ์ƒ์„ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“ค์–ด์ค˜์„œ ์ด ์ฑ…์ด ๋” ์žฌ๋ฐŒ๊ฒŒ ๋‹ค๊ฐ€์™”๋‹ค.
์ผ์ƒ์— ์ง€๋ฃจํ•จ์„ ๋Š๊ปด ๋„๋ฐ•๊ฐ™์€ ์‚ถ์„ ์‚ด๊ณ ์‹ถ๋‹ค๋ฉด ๋„๋ฐ•ํ•˜์ง€๋ง๊ณ  ์ฐจ๋ผ๋ฆฌ ์ด ์ฑ…์„ ๋ณด๊ธธ^^ใ…‹ 

Baseline separates input text into 13 sentences. You can see it can't distinguish final eomi(์ข…๊ฒฐ์–ด๋ฏธ) and connecting eomi(์—ฐ๊ฒฐ์–ด๋ฏธ), for example it splits ์ด๋Ÿฐ๊ฒŒ ์ค‘๋…์ด ๋˜๋‚˜? and ์‹ถ์—ˆ๋Š”๋ฐ. But ๋˜๋‚˜? is connecting eomi (์—ฐ๊ฒฐ์–ด๋ฏธ). And here's one more problem. It doesn't recognize embraced sentences (์•ˆ๊ธด๋ฌธ์žฅ). For example it splits ๋ชปํ•ด ๋น ์ง€์ง€ ์•Š์•˜์„๊นŒ? and ๋ผ๋Š” ์ƒ๊ฐ์„ ํ•˜๊ฒŒ ๋๋‹ค..

Koalanlp (KKMA)

์ฑ… ์†Œ๊ฐœ์— ์ด๊ฑด ์†Œ์„ค์ธ๊ฐ€ ์‹ค์ œ ์ธ๊ฐ€๋ผ๋Š” ๋ฌธ๊ตฌ๋ฅผ ๋ณด๊ณ  ์žฌ๋ฐŒ๊ฒ ๋‹ค ์‹ถ์–ด ๋ณด๊ฒŒ ๋˜์—ˆ๋‹ค.
' ๋ฐ”์นด๋ผ' ๋ผ๋Š” ๋„๋ฐ•์€ 2 ์žฅ์˜ ์นด๋“œ ํ•ฉ์ด ๋†’์€ ์‚ฌ๋žŒ์ด ์ด๊ธฐ๋Š” ๊ฒŒ์ž„์œผ๋กœ ์•„์ฃผ ๋‹จ์ˆœํ•œ ๊ฒŒ์ž„์ด๋‹ค.
์ด๋Ÿฐ ๊ฒŒ ์ค‘๋…์ด ๋˜๋‚˜?
์‹ถ์—ˆ๋Š”๋ฐ ์ด ์ฑ…์ด ๋ฐ”์นด๋ผ์™€ ๋น„์Šทํ•œ ๋งค๋ ฅ์ด ์žˆ๋‹ค ์ƒ๊ฐ ๋“ค์—ˆ๋‹ค.
๋‚ด์šฉ์ด ์Šคํ”ผ๋“œํ•˜๊ฒŒ ์ง„ํ–‰๋˜๊ณ  ๋ง‰ํžˆ๋Š” ๊ตฌ๊ฐ„ ์—†์ด ์ฝํžˆ๋Š” ๊ฒŒ ๋‚˜๋„ ๋ชจ๋ฅด๊ฒŒ ํŽ˜์ด์ง€๋ฅผ ์Šฅ์Šฅ ๋„˜๊ธฐ๊ณ  ์žˆ์—ˆ๋‹ค.
๋ฌผ๋ก  ์ฝ์Œ์œผ๋กœ์จ ํฐ ๋ˆ์„ ๋ฒŒ์ง„ ์•Š์ง€๋งŒ ์ด๋Ÿฐ ์Šคํ”ผ๋“œํ•จ์— ๋‚˜๋„ ๋ชจ๋ฅด๊ฒŒ ๊ณ„์† ๊ฒŒ์ž„์— ์ฐธ์—ฌํ•˜๊ฒŒ ๋˜๊ณ  ๋‚˜์˜ค๋Š” ํƒ€์ด๋ฐ์„ ์žก์ง€ ๋ชปํ•ด ๋น ์ง€์ง€ ์•Š์•˜์„๊นŒ?
๋ผ๋Š” ์ƒ๊ฐ์„ ํ•˜๊ฒŒ ๋๋‹ค.
์ด ์ฑ…์—์„œ ํ˜„์ง€์˜ ๊ฟˆ์€ ๊ฐ€๊ฒฉํ‘œ๋ฅผ ๋ณด์ง€ ์•Š๋Š” ์‚ถ์ด๋ผ ํ•œ๋‹ค.
์ด ๋ถ€๋ถ„์„ ์ฝ๊ณ  ๋‚˜๋ˆ๋ฐ!
๋ผ๋Š” ์ƒ๊ฐํ•˜๋ฉด์„œ ์ˆœ๊ฐ„ ๋„๋ฐ•์ด๋ผ๋Š” ๊ฑธ๋กœ๋ผ๋„ ๋ˆ์„ ๋งŽ์ด ๋ฒŒ์—ˆ๋˜ ํ˜„์ง€๊ฐ€ ๋ถ€๋Ÿฌ์› ๋‹ค.
๊ทธ๋Ÿฌ๋ฉด์„œ ๋‚ด๊ฐ€ ๋„๋ฐ•์„ ํ–ˆ๋‹ค๋ฉด? ๋ผ๋Š” ์ƒ์ƒ์„ ํ•ด๋ดค๋‹ค.
๊ทธ๋ฆฌ๊ณ  ์ด๋Ÿฐ ์ƒ์ƒ์„ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“ค์–ด ์ค˜์„œ ์ด ์ฑ…์ด ๋” ์žฌ๋ฐŒ๊ฒŒ ๋‹ค๊ฐ€์™”๋‹ค.
์ผ์ƒ์— ์ง€๋ฃจํ•จ์„ ๋Š๊ปด ๋„๋ฐ• ๊ฐ™์€ ์‚ถ์„ ์‚ด๊ณ  ์‹ถ๋‹ค๋ฉด ๋„๋ฐ•ํ•˜์ง€ ๋ง๊ณ  ์ฐจ๋ผ๋ฆฌ ์ด ์ฑ…์„ ๋ณด๊ธธ ^^ ใ…‹

The result of Koalanlp was really similar with baseline, the two problems (final-connecting eomi distinction, embracing sentences recognization) still exist.

Kiwi

์ฑ…์†Œ๊ฐœ์— ์ด๊ฑด ์†Œ์„ค์ธ๊ฐ€ ์‹ค์ œ์ธ๊ฐ€
๋ผ๋Š” ๋ฌธ๊ตฌ๋ฅผ ๋ณด๊ณ  ์žฌ๋ฐŒ๊ฒ ๋‹ค ์‹ถ์–ด ๋ณด๊ฒŒ ๋˜์—ˆ๋‹ค.
'๋ฐ”์นด๋ผ'๋ผ๋Š” ๋„๋ฐ•์€ 2์žฅ์˜ ์นด๋“œ ํ•ฉ์ด ๋†’์€ ์‚ฌ๋žŒ์ด ์ด๊ธฐ๋Š” ๊ฒŒ์ž„์œผ๋กœ ์•„์ฃผ ๋‹จ์ˆœํ•œ ๊ฒŒ์ž„์ด๋‹ค.
์ด๋Ÿฐ๊ฒŒ ์ค‘๋…์ด ๋˜๋‚˜?
์‹ถ์—ˆ๋Š”๋ฐ ์ด ์ฑ…์ด ๋ฐ”์นด๋ผ์™€ ๋น„์Šทํ•œ ๋งค๋ ฅ์ด ์žˆ๋‹ค ์ƒ๊ฐ๋“ค์—ˆ๋‹ค.
๋‚ด์šฉ์ด ์Šคํ”ผ๋“œํ•˜๊ฒŒ ์ง„ํ–‰๋˜๊ณ  ๋ง‰ํžˆ๋Š” ๊ตฌ๊ฐ„์—†์ด ์ฝํžˆ๋Š”๊ฒŒ ๋‚˜๋„ ๋ชจ๋ฅด๊ฒŒ ํŽ˜์ด์ง€๋ฅผ ์Šฅ์Šฅ ๋„˜๊ธฐ๊ณ  ์žˆ์—ˆ๋‹ค.
๋ฌผ๋ก  ์ฝ์Œ์œผ๋กœ์จ ํฐ ๋ˆ์„ ๋ฒŒ์ง„ ์•Š์ง€๋งŒ ์ด๋Ÿฐ ์Šคํ”ผ๋“œํ•จ์— ๋‚˜๋„ ๋ชจ๋ฅด๊ฒŒ ๊ณ„์† ๊ฒŒ์ž„์— ์ฐธ์—ฌํ•˜๊ฒŒ ๋˜๊ณ  ๋‚˜์˜ค๋Š” ํƒ€์ด๋ฐ์„ ์žก์ง€ ๋ชปํ•ด ๋น ์ง€์ง€ ์•Š์•˜์„๊นŒ?
๋ผ๋Š” ์ƒ๊ฐ์„ ํ•˜๊ฒŒ ๋๋‹ค.
์ด ์ฑ…์—์„œ ํ˜„์ง€์˜ ๊ฟˆ์€ ๊ฐ€๊ฒฉํ‘œ๋ฅผ ๋ณด์ง€ ์•Š๋Š” ์‚ถ์ด๋ผ ํ•œ๋‹ค.
์ด ๋ถ€๋ถ„์„ ์ฝ๊ณ  ๋‚˜๋ˆ๋ฐ!
๋ผ๋Š” ์ƒ๊ฐํ•˜๋ฉด์„œ ์ˆœ๊ฐ„ ๋„๋ฐ•์ด๋ผ๋Š”๊ฑธ๋กœ๋ผ๋„ ๋ˆ์„ ๋งŽ์ด ๋ฒŒ์—ˆ๋˜ ํ˜„์ง€๊ฐ€ ๋ถ€๋Ÿฌ์› ๋‹ค.
๊ทธ๋Ÿฌ๋ฉด์„œ ๋‚ด๊ฐ€ ๋„๋ฐ•์„ ํ–ˆ๋‹ค๋ฉด?
๋ผ๋Š” ์ƒ์ƒ์„ ํ•ด๋ดค๋‹ค.
๊ทธ๋ฆฌ๊ณ  ์ด๋Ÿฐ ์ƒ์ƒ์„ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“ค์–ด์ค˜์„œ ์ด ์ฑ…์ด ๋” ์žฌ๋ฐŒ๊ฒŒ ๋‹ค๊ฐ€์™”๋‹ค.
์ผ์ƒ์— ์ง€๋ฃจํ•จ์„ ๋Š๊ปด ๋„๋ฐ•๊ฐ™์€ ์‚ถ์„ ์‚ด๊ณ ์‹ถ๋‹ค๋ฉด ๋„๋ฐ•ํ•˜์ง€๋ง๊ณ  ์ฐจ๋ผ๋ฆฌ ์ด ์ฑ…์„ ๋ณด๊ธธ^^ใ…‹

The two problems are also shown in result of Kiwi. And it additionally splits ์‹ค์ œ์ธ๊ฐ€ and ๋ผ๋Š”, but ์ด๊ฑด ์†Œ์„ค์ธ๊ฐ€ ์‹ค์ œ์ธ๊ฐ€ is not an independent sentence, but an embraced sentence (์•ˆ๊ธด๋ฌธ์žฅ).

Kss (Mecab)

์ฑ…์†Œ๊ฐœ์— ์ด๊ฑด ์†Œ์„ค์ธ๊ฐ€ ์‹ค์ œ์ธ๊ฐ€๋ผ๋Š” ๋ฌธ๊ตฌ๋ฅผ ๋ณด๊ณ  ์žฌ๋ฐŒ๊ฒ ๋‹ค ์‹ถ์–ด ๋ณด๊ฒŒ ๋˜์—ˆ๋‹ค.
'๋ฐ”์นด๋ผ'๋ผ๋Š” ๋„๋ฐ•์€ 2์žฅ์˜ ์นด๋“œ ํ•ฉ์ด ๋†’์€ ์‚ฌ๋žŒ์ด ์ด๊ธฐ๋Š” ๊ฒŒ์ž„์œผ๋กœ ์•„์ฃผ ๋‹จ์ˆœํ•œ ๊ฒŒ์ž„์ด๋‹ค.
์ด๋Ÿฐ๊ฒŒ ์ค‘๋…์ด ๋˜๋‚˜? ์‹ถ์—ˆ๋Š”๋ฐ ์ด ์ฑ…์ด ๋ฐ”์นด๋ผ์™€ ๋น„์Šทํ•œ ๋งค๋ ฅ์ด ์žˆ๋‹ค ์ƒ๊ฐ๋“ค์—ˆ๋‹ค.
๋‚ด์šฉ์ด ์Šคํ”ผ๋“œํ•˜๊ฒŒ ์ง„ํ–‰๋˜๊ณ  ๋ง‰ํžˆ๋Š” ๊ตฌ๊ฐ„์—†์ด ์ฝํžˆ๋Š”๊ฒŒ ๋‚˜๋„ ๋ชจ๋ฅด๊ฒŒ ํŽ˜์ด์ง€๋ฅผ ์Šฅ์Šฅ ๋„˜๊ธฐ๊ณ  ์žˆ์—ˆ๋‹ค.
๋ฌผ๋ก  ์ฝ์Œ์œผ๋กœ์จ ํฐ ๋ˆ์„ ๋ฒŒ์ง„ ์•Š์ง€๋งŒ ์ด๋Ÿฐ ์Šคํ”ผ๋“œํ•จ์— ๋‚˜๋„ ๋ชจ๋ฅด๊ฒŒ ๊ณ„์† ๊ฒŒ์ž„์— ์ฐธ์—ฌํ•˜๊ฒŒ ๋˜๊ณ  ๋‚˜์˜ค๋Š” ํƒ€์ด๋ฐ์„ ์žก์ง€ ๋ชปํ•ด ๋น ์ง€์ง€ ์•Š์•˜์„๊นŒ? ๋ผ๋Š” ์ƒ๊ฐ์„ ํ•˜๊ฒŒ ๋๋‹ค.
์ด ์ฑ…์—์„œ ํ˜„์ง€์˜ ๊ฟˆ์€ ๊ฐ€๊ฒฉํ‘œ๋ฅผ ๋ณด์ง€ ์•Š๋Š” ์‚ถ์ด๋ผ ํ•œ๋‹ค.
์ด ๋ถ€๋ถ„์„ ์ฝ๊ณ  ๋‚˜๋ˆ๋ฐ! ๋ผ๋Š” ์ƒ๊ฐํ•˜๋ฉด์„œ ์ˆœ๊ฐ„ ๋„๋ฐ•์ด๋ผ๋Š”๊ฑธ๋กœ๋ผ๋„ ๋ˆ์„ ๋งŽ์ด ๋ฒŒ์—ˆ๋˜ ํ˜„์ง€๊ฐ€ ๋ถ€๋Ÿฌ์› ๋‹ค.
๊ทธ๋Ÿฌ๋ฉด์„œ ๋‚ด๊ฐ€ ๋„๋ฐ•์„ ํ–ˆ๋‹ค๋ฉด?๋ผ๋Š” ์ƒ์ƒ์„ ํ•ด๋ดค๋‹ค.
๊ทธ๋ฆฌ๊ณ  ์ด๋Ÿฐ ์ƒ์ƒ์„ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“ค์–ด์ค˜์„œ ์ด ์ฑ…์ด ๋” ์žฌ๋ฐŒ๊ฒŒ ๋‹ค๊ฐ€์™”๋‹ค.
์ผ์ƒ์— ์ง€๋ฃจํ•จ์„ ๋Š๊ปด ๋„๋ฐ•๊ฐ™์€ ์‚ถ์„ ์‚ด๊ณ ์‹ถ๋‹ค๋ฉด ๋„๋ฐ•ํ•˜์ง€๋ง๊ณ  ์ฐจ๋ผ๋ฆฌ ์ด ์ฑ…์„ ๋ณด๊ธธ^^ใ…‹

The result of Kss is same with gold label. This means that Kss considers the two problems. Of course, it's not easy to detect that parts while splitting sentences, so Kss has one more step after splitting sentences. It's postprocessing step which corrects some problems in segmenration results. For example, Korean sentence doesn't start from josa (์กฐ์‚ฌ) in general. Therefore if segmented results (sentences) started from josa (์กฐ์‚ฌ), Kss recognizes them as embraced sentences (์•ˆ๊ธด๋ฌธ์žฅ), and attaches them to their previous sentence. For your information, Kss has many more powerful postprocessing algorithms which correct wrong segmentation results like this.

In conclusion, Kss considers more than other libraries in Korean sentences. And these considerations led to difference in performance.

6) Speed analysis

I also measured speed of tools to compare their computation efficiency. The following table shows computation time of each tool when it splits sample.txt (41 sentences). This is a single blog post, so you can expect the following time when you split a blog post into sentences. Since the computation time may vary depending on the current CPU status, so I measured 5 times and calculated the average. Note that every experiment was conducted on single thread / process environment with my M1 macbook pro (2021, 13'inch).

Name Library version Backend Average time (msec)
Baseline N/A N/A 0.22
koalanlp 2.1.7 OKT 27.37
koalanlp 2.1.7 HNN 50.39
koalanlp 2.1.7 KMR 757.08
koalanlp 2.1.7 RHINO 978.53
koalanlp 2.1.7 EUNJEON 881.24
koalanlp 2.1.7 ARIRANG 1415.53
koalanlp 2.1.7 KKMA 1971.31
Kiwi 0.14.1 N/A 36.26
Kss (ours) 4.2.0 pecab 7050.50
Kss (ours) 4.2.0 mecab 46.81

The baseline was fastest (because it's a just regex function), and Koalanlp (OKT backend), Kiwi, Kss (mecab backend) followed. The slowest library was Kss (pecab backend) and it was about 160 times slower than its mecab backend. Mecab and Kiwi were written in C++, All Koalanlp backends were written in Java and Pecab was written in pure python. I think this difference was caused by speed of each language. Therefore, if you can install mecab, it makes most sense to use Kss Mecab backend.

  • For Linux/MacOS users: Kss tries to install python-mecab-kor when you install kss. so you can use mecab backend very easily. But if it was failed, please install mecab yourself to use mecab backend.

  • For Windows users: Kss supports mecab-ko-msvc (mecab for Microsoft Visual C++), and its konlpy wrapper. To use mecab backend, you need to install one of mecab and konlpy.tag.Mecab on your machine. There are much information about mecab installing on Windows machine in internet like the following.


7) Conclusion

I've measured the performance of Kss and other libraries using 7 evaluation datasets, and also measured their speed. And I proposed a new metric named 'Normalized F1 score'. In terms of segmentation performance, Kss performed best on most datasets. In terms of speed, baseline was the fastest, and Koalanlp (OKT backend) and Kiwi followed. but Kss (mecab backend) also showed a speed that could compete with others.

Although much progress has been made by Kiwi and Kss, there are still many difficulties and limitations in Korean sentence segmentation libraries. In fact, it's also because very few people attack this task. If anyone wants to discuss Korean sentence segmentation algorithms with me or contribute to my work, feel free to send an email to [email protected] or let me know on the Github issue page.


2) split_morphemes: split text into morphemes

from kss import split_morphemes

split_morphemes(
    text: Union[str, List[str], Tuple[str]],
    backend: str = "auto",
    num_workers: Union[int, str] = "auto",
    drop_space: bool = True,
    return_pos: bool = True,
)
Parameters
  • text: String or List/Tuple of strings
    • string: single text segmentation
    • list/tuple of strings: batch texts segmentation
  • backend: Morpheme analyzer backend.
    • backend='auto': find mecab โ†’ konlpy.tag.Mecab โ†’ pecab โ†’ punct and use first found analyzer (default)
    • backend='mecab': find mecab โ†’ konlpy.tag.Mecab and use first found analyzer
    • backend='pecab': use pecab analyzer
    • backend='punct': split sentences only near punctuation marks
  • num_workers: The number of multiprocessing workers
    • num_workers='auto': use multiprocessing with the maximum number of workers if possible (default)
    • num_workers=1: don't use multiprocessing
    • num_workers=2~N: use multiprocessing with the specified number of workers
  • drop_space: Whether it drops all space characters or not
    • drop_space=True: drop all space characters in output (default)
    • drop_space=False: remain all space characters in output
  • return_pos: Return pos information or not
    • return_pos=True: return results with pos information (default)
    • return_pos=False: return results without pos information
Usages
  • Single text segmentation

    import kss
    
    text = "ํšŒ์‚ฌ ๋™๋ฃŒ ๋ถ„๋“ค๊ณผ ๋‹ค๋…€์™”๋Š”๋ฐ ๋ถ„์œ„๊ธฐ๋„ ์ข‹๊ณ  ์Œ์‹๋„ ๋ง›์žˆ์—ˆ์–ด์š” ๋‹ค๋งŒ, ๊ฐ•๋‚จ ํ† ๋ผ์ •์ด ๊ฐ•๋‚จ ์‰‘์‰‘๋ฒ„๊ฑฐ ๊ณจ๋ชฉ๊ธธ๋กœ ์ญ‰ ์˜ฌ๋ผ๊ฐ€์•ผ ํ•˜๋Š”๋ฐ ๋‹ค๋“ค ์‰‘์‰‘๋ฒ„๊ฑฐ์˜ ์œ ํ˜น์— ๋„˜์–ด๊ฐˆ ๋ป” ํ–ˆ๋‹ต๋‹ˆ๋‹ค ๊ฐ•๋‚จ์—ญ ๋ง›์ง‘ ํ† ๋ผ์ •์˜ ์™ธ๋ถ€ ๋ชจ์Šต."
    
    kss.split_morphemes(text)
    # [('ํšŒ์‚ฌ', 'NNG'), ('๋™๋ฃŒ', 'NNG'), ('๋ถ„', 'NNB'), ('๋“ค', 'XSN'), ('๊ณผ', 'JKB'), ('๋‹ค๋…€์™”', 'VV+EP'), ('๋Š”๋ฐ', 'EC'), ('๋ถ„์œ„๊ธฐ', 'NNG'), ('๋„', 'JX'), ('์ข‹', 'VA'), ('๊ณ ', 'EC'), ('์Œ์‹', 'NNG'), ('๋„', 'JX'), ('๋ง›์žˆ', 'VA'), ('์—ˆ', 'EP'), ('์–ด์š”', 'EF'), ('๋‹ค๋งŒ', 'MAJ'), (',', 'SC'), ('๊ฐ•๋‚จ', 'NNP'), ('ํ† ๋ผ', 'NNG'), ('์ •', 'NNG'), ('์ด', 'JKS'), ('๊ฐ•๋‚จ', 'NNP'), ('์‰‘์‰‘', 'MAG'), ('๋ฒ„๊ฑฐ', 'NNG'), ('๊ณจ๋ชฉ๊ธธ', 'NNG'), ('๋กœ', 'JKB'), ('์ญ‰', 'MAG'), ('์˜ฌ๋ผ๊ฐ€', 'VV'), ('์•ผ', 'EC'), ('ํ•˜', 'VV'), ('๋Š”๋ฐ', 'EC'), ('๋‹ค', 'MAG'), ('๋“ค', 'XSN'), ('์‰‘์‰‘', 'MAG'), ('๋ฒ„๊ฑฐ', 'NNG'), ('์˜', 'JKG'), ('์œ ํ˜น', 'NNG'), ('์—', 'JKB'), ('๋„˜์–ด๊ฐˆ', 'VV+ETM'), ('๋ป”', 'NNB'), ('ํ–ˆ', 'VV+EP'), ('๋‹ต๋‹ˆ๋‹ค', 'EC'), ('๊ฐ•๋‚จ์—ญ', 'NNP'), ('๋ง›์ง‘', 'NNG'), ('ํ† ๋ผ', 'NNG'), ('์ •์˜', 'NNG'), ('์™ธ๋ถ€', 'NNG'), ('๋ชจ์Šต', 'NNG'), ('.', 'SF')]
  • Batch texts segmentation

    import kss
    
    texts = [
        "ํšŒ์‚ฌ ๋™๋ฃŒ ๋ถ„๋“ค๊ณผ ๋‹ค๋…€์™”๋Š”๋ฐ ๋ถ„์œ„๊ธฐ๋„ ์ข‹๊ณ  ์Œ์‹๋„ ๋ง›์žˆ์—ˆ์–ด์š” ๋‹ค๋งŒ, ๊ฐ•๋‚จ ํ† ๋ผ์ •์ด ๊ฐ•๋‚จ ์‰‘์‰‘๋ฒ„๊ฑฐ ๊ณจ๋ชฉ๊ธธ๋กœ ์ญ‰ ์˜ฌ๋ผ๊ฐ€์•ผ ํ•˜๋Š”๋ฐ ๋‹ค๋“ค ์‰‘์‰‘๋ฒ„๊ฑฐ์˜ ์œ ํ˜น์— ๋„˜์–ด๊ฐˆ ๋ป” ํ–ˆ๋‹ต๋‹ˆ๋‹ค",
        "๊ฐ•๋‚จ์—ญ ๋ง›์ง‘ ํ† ๋ผ์ •์˜ ์™ธ๋ถ€ ๋ชจ์Šต. ๊ฐ•๋‚จ ํ† ๋ผ์ •์€ 4์ธต ๊ฑด๋ฌผ ๋…์ฑ„๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.",
        "์—ญ์‹œ ํ† ๋ผ์ • ๋ณธ ์  ๋‹ต์ฃ ?ใ…Žใ……ใ…Ž ๊ฑด๋ฌผ์€ ํฌ์ง€๋งŒ ๊ฐ„ํŒ์ด ์—†๊ธฐ ๋•Œ๋ฌธ์— ์ง€๋‚˜์น  ์ˆ˜ ์žˆ์œผ๋‹ˆ ์กฐ์‹ฌํ•˜์„ธ์š” ๊ฐ•๋‚จ ํ† ๋ผ์ •์˜ ๋‚ด๋ถ€ ์ธํ…Œ๋ฆฌ์–ด.",
    ]
    
    kss.split_morphemes(texts)
    # [[('ํšŒ์‚ฌ', 'NNG'), ('๋™๋ฃŒ', 'NNG'), ('๋ถ„', 'NNB'), ('๋“ค', 'XSN'), ('๊ณผ', 'JKB'), ('๋‹ค๋…€์™”', 'VV+EP'), ('๋Š”๋ฐ', 'EC'), ('๋ถ„์œ„๊ธฐ', 'NNG'), ('๋„', 'JX'), ('์ข‹', 'VA'), ('๊ณ ', 'EC'), ('์Œ์‹', 'NNG'), ('๋„', 'JX'), ('๋ง›์žˆ', 'VA'), ('์—ˆ', 'EP'), ('์–ด์š”', 'EF'), ('๋‹ค๋งŒ', 'MAJ'), (',', 'SC'), ('๊ฐ•๋‚จ', 'NNP'), ('ํ† ๋ผ', 'NNG'), ('์ •', 'NNG'), ('์ด', 'JKS'), ('๊ฐ•๋‚จ', 'NNP'), ('์‰‘์‰‘', 'MAG'), ('๋ฒ„๊ฑฐ', 'NNG'), ('๊ณจ๋ชฉ๊ธธ', 'NNG'), ('๋กœ', 'JKB'), ('์ญ‰', 'MAG'), ('์˜ฌ๋ผ๊ฐ€', 'VV'), ('์•ผ', 'EC'), ('ํ•˜', 'VV'), ('๋Š”๋ฐ', 'EC'), ('๋‹ค', 'MAG'), ('๋“ค', 'XSN'), ('์‰‘์‰‘', 'MAG'), ('๋ฒ„๊ฑฐ', 'NNG'), ('์˜', 'JKG'), ('์œ ํ˜น', 'NNG'), ('์—', 'JKB'), ('๋„˜์–ด๊ฐˆ', 'VV+ETM'), ('๋ป”', 'NNB'), ('ํ–ˆ', 'VV+EP'), ('๋‹ต๋‹ˆ๋‹ค', 'EC')], 
    # [('๊ฐ•๋‚จ์—ญ', 'NNP'), ('๋ง›์ง‘', 'NNG'), ('ํ† ๋ผ', 'NNG'), ('์ •์˜', 'NNG'), ('์™ธ๋ถ€', 'NNG'), ('๋ชจ์Šต', 'NNG'), ('.', 'SF'), ('๊ฐ•๋‚จ', 'NNP'), ('ํ† ๋ผ', 'NNG'), ('์ •์€', 'NNP'), ('4', 'SN'), ('์ธต', 'NNG'), ('๊ฑด๋ฌผ', 'NNG'), ('๋…์ฑ„', 'NNG'), ('๋กœ', 'JKB'), ('์ด๋ฃจ์–ด์ ธ', 'VV+EC'), ('์žˆ', 'VX'), ('์Šต๋‹ˆ๋‹ค', 'EF'), ('.', 'SF')], 
    # [('์—ญ์‹œ', 'MAJ'), ('ํ† ๋ผ', 'NNG'), ('์ •', 'NNG'), ('๋ณธ', 'VV+ETM'), ('์ ', 'NNB'), ('๋‹ต', 'MAG+VCP'), ('์ฃ ', 'EF'), ('?', 'SF'), ('ใ…Ž', 'IC'), ('ใ……', 'NNG'), ('ใ…Ž', 'IC'), ('๊ฑด๋ฌผ', 'NNG'), ('์€', 'JX'), ('ํฌ', 'VA'), ('์ง€๋งŒ', 'EC'), ('๊ฐ„ํŒ', 'NNG'), ('์ด', 'JKS'), ('์—†', 'VA'), ('๊ธฐ', 'ETN'), ('๋•Œ๋ฌธ', 'NNB'), ('์—', 'JKB'), ('์ง€๋‚˜์น ', 'VV+ETM'), ('์ˆ˜', 'NNB'), ('์žˆ', 'VV'), ('์œผ๋‹ˆ', 'EC'), ('์กฐ์‹ฌ', 'NNG'), ('ํ•˜', 'XSV'), ('์„ธ์š”', 'EP+EF'), ('๊ฐ•๋‚จ', 'NNP'), ('ํ† ๋ผ', 'NNG'), ('์ •์˜', 'NNG'), ('๋‚ด๋ถ€', 'NNG'), ('์ธํ…Œ๋ฆฌ์–ด', 'NNG'), ('.', 'SF')]]
  • Remain space characters for original text recoverability

    import kss
    
    text = "ํšŒ์‚ฌ ๋™๋ฃŒ ๋ถ„๋“ค๊ณผ ๋‹ค๋…€์™”๋Š”๋ฐ ๋ถ„์œ„๊ธฐ๋„ ์ข‹๊ณ  ์Œ์‹๋„ ๋ง›์žˆ์—ˆ์–ด์š”\n๋‹ค๋งŒ,\t๊ฐ•๋‚จ ํ† ๋ผ์ •์ด ๊ฐ•๋‚จ ์‰‘์‰‘๋ฒ„๊ฑฐ ๊ณจ๋ชฉ๊ธธ๋กœ ์ญ‰ ์˜ฌ๋ผ๊ฐ€์•ผ ํ•˜๋Š”๋ฐ ๋‹ค๋“ค ์‰‘์‰‘๋ฒ„๊ฑฐ์˜ ์œ ํ˜น์— ๋„˜์–ด๊ฐˆ ๋ป” ํ–ˆ๋‹ต๋‹ˆ๋‹ค ๊ฐ•๋‚จ์—ญ ๋ง›์ง‘ ํ† ๋ผ์ •์˜ ์™ธ๋ถ€ ๋ชจ์Šต."
    
    kss.split_morphemes(text, drop_space=False)
    # [('ํšŒ์‚ฌ', 'NNG'), (' ', 'SP'), ('๋™๋ฃŒ', 'NNG'), (' ', 'SP'), ('๋ถ„', 'NNB'), ('๋“ค', 'XSN'), ('๊ณผ', 'JKB'), (' ', 'SP'), ('๋‹ค๋…€์™”', 'VV+EP'), ('๋Š”๋ฐ', 'EC'), (' ', 'SP'), ('๋ถ„์œ„๊ธฐ', 'NNG'), ('๋„', 'JX'), (' ', 'SP'), ('์ข‹', 'VA'), ('๊ณ ', 'EC'), (' ', 'SP'), ('์Œ์‹', 'NNG'), ('๋„', 'JX'), (' ', 'SP'), ('๋ง›์žˆ', 'VA'), ('์—ˆ', 'EP'), ('์–ด์š”', 'EF'), ('\n', 'SP'), ('๋‹ค๋งŒ', 'MAJ'), (',', 'SC'), ('\t', 'SP'), ('๊ฐ•๋‚จ', 'NNP'), (' ', 'SP'), ('ํ† ๋ผ', 'NNG'), ('์ •', 'NNG'), ('์ด', 'JKS'), (' ', 'SP'), ('๊ฐ•๋‚จ', 'NNP'), (' ', 'SP'), ('์‰‘์‰‘', 'MAG'), ('๋ฒ„๊ฑฐ', 'NNG'), (' ', 'SP'), ('๊ณจ๋ชฉ๊ธธ', 'NNG'), ('๋กœ', 'JKB'), (' ', 'SP'), ('์ญ‰', 'MAG'), (' ', 'SP'), ('์˜ฌ๋ผ๊ฐ€', 'VV'), ('์•ผ', 'EC'), (' ', 'SP'), ('ํ•˜', 'VV'), ('๋Š”๋ฐ', 'EC'), (' ', 'SP'), ('๋‹ค', 'MAG'), ('๋“ค', 'XSN'), (' ', 'SP'), ('์‰‘์‰‘', 'MAG'), ('๋ฒ„๊ฑฐ', 'NNG'), ('์˜', 'JKG'), (' ', 'SP'), ('์œ ํ˜น', 'NNG'), ('์—', 'JKB'), (' ', 'SP'), ('๋„˜์–ด๊ฐˆ', 'VV+ETM'), (' ', 'SP'), ('๋ป”', 'NNB'), (' ', 'SP'), ('ํ–ˆ', 'VV+EP'), ('๋‹ต๋‹ˆ๋‹ค', 'EC'), (' ', 'SP'), ('๊ฐ•๋‚จ์—ญ', 'NNP'), (' ', 'SP'), ('๋ง›์ง‘', 'NNG'), (' ', 'SP'), ('ํ† ๋ผ', 'NNG'), ('์ •์˜', 'NNG'), (' ', 'SP'), ('์™ธ๋ถ€', 'NNG'), (' ', 'SP'), ('๋ชจ์Šต', 'NNG'), ('.', 'SF')]
  • Get result without POS tagging.

    import kss
    
    text = "ํšŒ์‚ฌ ๋™๋ฃŒ ๋ถ„๋“ค๊ณผ ๋‹ค๋…€์™”๋Š”๋ฐ ๋ถ„์œ„๊ธฐ๋„ ์ข‹๊ณ  ์Œ์‹๋„ ๋ง›์žˆ์—ˆ์–ด์š”\n๋‹ค๋งŒ,\t๊ฐ•๋‚จ ํ† ๋ผ์ •์ด ๊ฐ•๋‚จ ์‰‘์‰‘๋ฒ„๊ฑฐ ๊ณจ๋ชฉ๊ธธ๋กœ ์ญ‰ ์˜ฌ๋ผ๊ฐ€์•ผ ํ•˜๋Š”๋ฐ ๋‹ค๋“ค ์‰‘์‰‘๋ฒ„๊ฑฐ์˜ ์œ ํ˜น์— ๋„˜์–ด๊ฐˆ ๋ป” ํ–ˆ๋‹ต๋‹ˆ๋‹ค ๊ฐ•๋‚จ์—ญ ๋ง›์ง‘ ํ† ๋ผ์ •์˜ ์™ธ๋ถ€ ๋ชจ์Šต."
    
    kss.split_morphemes(text, return_pos=False)
    # ['ํšŒ์‚ฌ', '๋™๋ฃŒ', '๋ถ„', '๋“ค', '๊ณผ', '๋‹ค๋…€์™”', '๋Š”๋ฐ', '๋ถ„์œ„๊ธฐ', '๋„', '์ข‹', '๊ณ ', '์Œ์‹', '๋„', '๋ง›์žˆ', '์—ˆ', '์–ด์š”', '๋‹ค๋งŒ', ',', '๊ฐ•๋‚จ', 'ํ† ๋ผ', '์ •', '์ด', '๊ฐ•๋‚จ', '์‰‘์‰‘', '๋ฒ„๊ฑฐ', '๊ณจ๋ชฉ๊ธธ', '๋กœ', '์ญ‰', '์˜ฌ๋ผ๊ฐ€', '์•ผ', 'ํ•˜', '๋Š”๋ฐ', '๋‹ค', '๋“ค', '์‰‘์‰‘', '๋ฒ„๊ฑฐ', '์˜', '์œ ํ˜น', '์—', '๋„˜์–ด๊ฐˆ', '๋ป”', 'ํ–ˆ', '๋‹ต๋‹ˆ๋‹ค', '๊ฐ•๋‚จ์—ญ', '๋ง›์ง‘', 'ํ† ๋ผ', '์ •์˜', '์™ธ๋ถ€', '๋ชจ์Šต', '.']

3) summarize_sentences: summarize text into important sentences

from kss import summarize_sentences

summarize_sentences(
    text: Union[str, List[str], Tuple[str]],
    backend: str = "auto",
    num_workers: Union[int, str] = "auto",
    max_sentences: int = 3,
    tolerance: float: 0.05,
    strip: bool = True,
    ignores: List[str] = None,
)
Parameters
  • text: String or List/Tuple of strings
    • string: single text segmentation
    • list/tuple of strings: batch texts segmentation
  • backend: Morpheme analyzer backend.
    • backend='auto': find mecab โ†’ konlpy.tag.Mecab โ†’ pecab โ†’ punct and use first found analyzer (default)
    • backend='mecab': find mecab โ†’ konlpy.tag.Mecab and use first found analyzer
    • backend='pecab': use pecab analyzer
    • backend='punct': split sentences only near punctuation marks
  • num_workers: The number of multiprocessing workers
    • num_workers='auto': use multiprocessing with the maximum number of workers if possible (default)
    • num_workers=1: don't use multiprocessing
    • num_workers=2~N: use multiprocessing with the specified number of workers
  • max_sentences: The maximum number of output sentences
    • max_sentences=1~N: return 1~N sentences by sentence importance
  • tolerance: Threshold for omitting edge weights.
  • strip: Whether it does strip() for all output sentences or not
    • strip=True: do strip() for all output sentences (default)
    • strip=False: do not strip() for all output sentences
  • ignores: ignore strings to do not split
    • See detailed usage from the following Usages
Usages
  • Single text summarization

    import kss
    
    text = """๊ฐœ๊ทธ๋งจ ๊ฒธ ๊ฐ€์ˆ˜ โ€˜๊ฐœ๊ฐ€์ˆ˜โ€™ UV ์œ ์„ธ์œค์ด ์‹ ๊ณก ๋ฐœ๋งค ์ดํ›„ ๋งŽ์€ ๋‚จํŽธ๋“ค์˜ ์‘์›์„ ๋ฐ›๊ณ  ์žˆ๋‹ค. ์œ ์„ธ์œค์€ ์ง€๋‚œ 3์ผ ์˜คํ›„ 6์‹œ ์ƒˆ ์‹ฑ๊ธ€ โ€˜๋งˆ๋” ์‚ฌ์ปค(Mother Soccer)(Feat. ์ˆ˜ํผ๋น„)โ€™๋ฅผ ๋ฐœ๋งคํ–ˆ๋‹ค. โ€˜๋งˆ๋” ์‚ฌ์ปคโ€™๋Š” ์•„๋‚ด์— ๋Œ€ํ•œ ์„œ์šดํ•œ ๋งˆ์Œ์„ ์œ„ํŠธ ์žˆ๊ณ  ๊ฐ•ํ•œ ์–ด์กฐ๋กœ ๋””์Šค ํ•˜๋Š” ๋‚จํŽธ ์œ ์„ธ์œค์˜ ๋งˆ์Œ์„ ๋‹ด์€ ๊ณก์ด๋‹ค. ๋ฐœ๋งค ํ›„ ์†Œ์…œ ๋ฏธ๋””์–ด ์ƒ์—์„œ ํ™”์ œ๋ฅผ ๋ชจ์œผ๊ณ  ์žˆ๋Š” ๊ฐ€์šด๋ฐ, ๊ฐ€์ˆ˜ ํ•˜๋™๊ท ์€ โ€œ์œ ์„ธ์œ ๋‹ˆ ๊ดœ์ฐฎ๊ฒ ์–ดโ€๋ผ๋Š” ๋ฐ˜์‘์„ ๋ณด์ด๊ธฐ๋„ ํ–ˆ๋‹ค. ๋ˆ„๋ฆฌ๊พผ๋“ค์€ โ€˜๋‘ ๋ถ„์˜ ์›๋งŒํ•œ ํ•ฉ์˜๊ฐ€ ์žˆ๊ธฐ๋ฅผ ๋ฐ”๋ž๋‹ˆ๋‹คโ€™, โ€˜์ง‘์—๋Š” ๋“ค์–ด๊ฐˆ ์ˆ˜ ์žˆ๊ฒ ๋‚˜โ€™ ๋“ฑ ์œ ์„ธ์œค์˜ ๊ท€๊ฐ€๋ฅผ ๊ฑฑ์ •ํ•˜๋Š” ๋ชจ์Šต์„ ๋ณด์˜€๋‹ค. ์œ ์„ธ์œค์€ ์ ์ž…๊ฐ€๊ฒฝ์œผ๋กœ โ€˜๋งˆ๋” ์‚ฌ์ปคโ€™ ์ฑŒ๋ฆฐ์ง€๋ฅผ ์‹œ์ž‘, ์ž์‹ ์˜ SNS๋ฅผ ํ†ตํ•ด โ€œ๋ถ€๋ถ€ ์‹ธ์›€์ด ์ข€ ์ปค์กŒ๋„ค์š”โ€๋ผ๋ฉฐ ๋ฐฐ์šฐ ์†ก์ง„์šฐ์™€ ํ•จ๊ป˜ ์ดฌ์˜ํ•œ ์˜์ƒ์„ ๊ฒŒ์žฌํ–ˆ๋‹ค. ํ•ด๋‹น ์˜์ƒ์—์„œ๋Š” ์–‘๋ง์„ ์‹ ๊ณ  ์นจ๋Œ€์— ๋“ค์–ด๊ฐ„ ๋’ค ํ™˜ํ˜ธ๋ฅผ ์ง€๋ฅด๊ฑฐ๋‚˜ ํ™”์žฅ์‹ค ๋ถˆ์„ ๋„์ง€ ์•Š๊ณ  ๋„๋ง๊ฐ€๋Š” ๋“ฑ ์•„๋‚ด์˜ ์ž”์†Œ๋ฆฌ ์œ ๋ฐœ ํฌ์ธํŠธ๋ฅผ ์‚ด๋ ค ์žฌ์น˜ ์žˆ๋Š” ์˜์ƒ์„ ์™„์„ฑํ–ˆ๋‹ค. ์œ ์„ธ์œค์€ โ€˜๋งˆ๋” ์‚ฌ์ปคโ€™๋ฅผ ํ†ตํ•ด ๋‚จํŽธ๋“ค์˜ ๋งˆ์Œ์„ ๋Œ€๋ณ€ํ•ด ์ฃผ๊ณ  ์žˆ๋Š” ํ•œํŽธ ์•„๋‚ด์˜ ๋ฐ˜์‘์€ ์–ด๋–จ์ง€ ๊ถ๊ธˆ์ฆ์„ ๋ชจ์€๋‹ค."""
    
    kss.summarize_sentences(text)
    # ['๊ฐœ๊ทธ๋งจ ๊ฒธ ๊ฐ€์ˆ˜ โ€˜๊ฐœ๊ฐ€์ˆ˜โ€™ UV ์œ ์„ธ์œค์ด ์‹ ๊ณก ๋ฐœ๋งค ์ดํ›„ ๋งŽ์€ ๋‚จํŽธ๋“ค์˜ ์‘์›์„ ๋ฐ›๊ณ  ์žˆ๋‹ค.', 'โ€˜๋งˆ๋” ์‚ฌ์ปคโ€™๋Š” ์•„๋‚ด์— ๋Œ€ํ•œ ์„œ์šดํ•œ ๋งˆ์Œ์„ ์œ„ํŠธ ์žˆ๊ณ  ๊ฐ•ํ•œ ์–ด์กฐ๋กœ ๋””์Šค ํ•˜๋Š” ๋‚จํŽธ ์œ ์„ธ์œค์˜ ๋งˆ์Œ์„ ๋‹ด์€ ๊ณก์ด๋‹ค.', '์œ ์„ธ์œค์€ โ€˜๋งˆ๋” ์‚ฌ์ปคโ€™๋ฅผ ํ†ตํ•ด ๋‚จํŽธ๋“ค์˜ ๋งˆ์Œ์„ ๋Œ€๋ณ€ํ•ด ์ฃผ๊ณ  ์žˆ๋Š” ํ•œํŽธ ์•„๋‚ด์˜ ๋ฐ˜์‘์€ ์–ด๋–จ์ง€ ๊ถ๊ธˆ์ฆ์„ ๋ชจ์€๋‹ค.']
  • Batch texts summarization

    import kss
    
    texts = [
        """๊ฐœ๊ทธ๋งจ ๊ฒธ ๊ฐ€์ˆ˜ โ€˜๊ฐœ๊ฐ€์ˆ˜โ€™ UV ์œ ์„ธ์œค์ด ์‹ ๊ณก ๋ฐœ๋งค ์ดํ›„ ๋งŽ์€ ๋‚จํŽธ๋“ค์˜ ์‘์›์„ ๋ฐ›๊ณ  ์žˆ๋‹ค. ์œ ์„ธ์œค์€ ์ง€๋‚œ 3์ผ ์˜คํ›„ 6์‹œ ์ƒˆ ์‹ฑ๊ธ€ โ€˜๋งˆ๋” ์‚ฌ์ปค(Mother Soccer)(Feat. ์ˆ˜ํผ๋น„)โ€™๋ฅผ ๋ฐœ๋งคํ–ˆ๋‹ค. โ€˜๋งˆ๋” ์‚ฌ์ปคโ€™๋Š” ์•„๋‚ด์— ๋Œ€ํ•œ ์„œ์šดํ•œ ๋งˆ์Œ์„ ์œ„ํŠธ ์žˆ๊ณ  ๊ฐ•ํ•œ ์–ด์กฐ๋กœ ๋””์Šค ํ•˜๋Š” ๋‚จํŽธ ์œ ์„ธ์œค์˜ ๋งˆ์Œ์„ ๋‹ด์€ ๊ณก์ด๋‹ค. ๋ฐœ๋งค ํ›„ ์†Œ์…œ ๋ฏธ๋””์–ด ์ƒ์—์„œ ํ™”์ œ๋ฅผ ๋ชจ์œผ๊ณ  ์žˆ๋Š” ๊ฐ€์šด๋ฐ, ๊ฐ€์ˆ˜ ํ•˜๋™๊ท ์€ โ€œ์œ ์„ธ์œ ๋‹ˆ ๊ดœ์ฐฎ๊ฒ ์–ดโ€๋ผ๋Š” ๋ฐ˜์‘์„ ๋ณด์ด๊ธฐ๋„ ํ–ˆ๋‹ค. ๋ˆ„๋ฆฌ๊พผ๋“ค์€ โ€˜๋‘ ๋ถ„์˜ ์›๋งŒํ•œ ํ•ฉ์˜๊ฐ€ ์žˆ๊ธฐ๋ฅผ ๋ฐ”๋ž๋‹ˆ๋‹คโ€™, โ€˜์ง‘์—๋Š” ๋“ค์–ด๊ฐˆ ์ˆ˜ ์žˆ๊ฒ ๋‚˜โ€™ ๋“ฑ ์œ ์„ธ์œค์˜ ๊ท€๊ฐ€๋ฅผ ๊ฑฑ์ •ํ•˜๋Š” ๋ชจ์Šต์„ ๋ณด์˜€๋‹ค. ์œ ์„ธ์œค์€ ์ ์ž…๊ฐ€๊ฒฝ์œผ๋กœ โ€˜๋งˆ๋” ์‚ฌ์ปคโ€™ ์ฑŒ๋ฆฐ์ง€๋ฅผ ์‹œ์ž‘, ์ž์‹ ์˜ SNS๋ฅผ ํ†ตํ•ด โ€œ๋ถ€๋ถ€ ์‹ธ์›€์ด ์ข€ ์ปค์กŒ๋„ค์š”โ€๋ผ๋ฉฐ ๋ฐฐ์šฐ ์†ก์ง„์šฐ์™€ ํ•จ๊ป˜ ์ดฌ์˜ํ•œ ์˜์ƒ์„ ๊ฒŒ์žฌํ–ˆ๋‹ค. ํ•ด๋‹น ์˜์ƒ์—์„œ๋Š” ์–‘๋ง์„ ์‹ ๊ณ  ์นจ๋Œ€์— ๋“ค์–ด๊ฐ„ ๋’ค ํ™˜ํ˜ธ๋ฅผ ์ง€๋ฅด๊ฑฐ๋‚˜ ํ™”์žฅ์‹ค ๋ถˆ์„ ๋„์ง€ ์•Š๊ณ  ๋„๋ง๊ฐ€๋Š” ๋“ฑ ์•„๋‚ด์˜ ์ž”์†Œ๋ฆฌ ์œ ๋ฐœ ํฌ์ธํŠธ๋ฅผ ์‚ด๋ ค ์žฌ์น˜ ์žˆ๋Š” ์˜์ƒ์„ ์™„์„ฑํ–ˆ๋‹ค. ์œ ์„ธ์œค์€ โ€˜๋งˆ๋” ์‚ฌ์ปคโ€™๋ฅผ ํ†ตํ•ด ๋‚จํŽธ๋“ค์˜ ๋งˆ์Œ์„ ๋Œ€๋ณ€ํ•ด ์ฃผ๊ณ  ์žˆ๋Š” ํ•œํŽธ ์•„๋‚ด์˜ ๋ฐ˜์‘์€ ์–ด๋–จ์ง€ ๊ถ๊ธˆ์ฆ์„ ๋ชจ์€๋‹ค.""",
        """์ œ์ž„์Šค ์นด๋ฉ”๋ก  ๊ฐ๋…์˜ ์˜ํ™” โ€˜์•„๋ฐ”ํƒ€: ๋ฌผ์˜ ๊ธธโ€™(์•„๋ฐ”ํƒ€2)์ด ๊ฐœ๋ด‰ 21์ผ ๋งŒ์— ์ „๊ตญ ๋ˆ„์  ๊ด€๊ฐ 800๋งŒ๋ช…์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค. ์˜ฌํ•ด ๊ตญ๋‚ด ์ฒซ โ€˜์ฒœ๋งŒ ์˜ํ™”โ€™๊ฐ€ ๋ ์ง€ ์ฃผ๋ชฉ๋œ๋‹ค. 4์ผ ์˜ํ™”์ง„ํฅ์œ„์›ํšŒ(์˜์ง„์œ„)์— ๋”ฐ๋ฅด๋ฉด ์ง€๋‚œ๋‹ฌ 14์ผ ๊ฐœ๋ด‰ํ•œ โ€˜์•„๋ฐ”ํƒ€2โ€™๋Š” ์ „๋‚  11๋งŒ3902๋ช…์˜ ๊ด€๊ฐ์„ ๋ชจ์•˜๋‹ค. ๋ˆ„์  ๊ด€๊ฐ 800๋งŒ1930๋ช…์œผ๋กœ ์ „ํŽธ โ€˜์•„๋ฐ”ํƒ€โ€™๋ณด๋‹ค 4์ผ ๋น ๋ฅธ ๊ธฐ๋ก์ด๋‹ค. ์ด๋Š” ํ•œ๊ตญ์˜ ์‹ ์ข… ์ฝ”๋กœ๋‚˜๋ฐ”์ด๋Ÿฌ์Šค ๊ฐ์—ผ์ฆ(์ฝ”๋กœ๋‚˜19) ํŒฌ๋ฐ๋ฏน ์ดํ›„ ์„ธ ๋ฒˆ์งธ ๊ธฐ๋ก์ด๋‹ค. ์ง€๋‚œํ•ด ๊ฐœ๋ด‰ํ•œ โ€˜๋ฒ”์ฃ„๋„์‹œ2โ€™์™€ โ€˜ํƒ‘๊ฑด: ๋งค๋ฒ„๋ฆญโ€™์— ์ด์–ด ๊ด€๊ฐ 800๋งŒ๋ช…์„ ๋„˜์–ด์„  ๊ฒƒ์ด๋‹ค. 2021๋…„ 12์›” ๊ฐœ๋ด‰ํ•œ โ€˜์ŠคํŒŒ์ด๋”๋งจ: ๋…ธ ์›จ์ด ํ™ˆโ€™๋„ 800๋งŒ๋ช…์„ ๋„˜์ง€ ๋ชปํ–ˆ๋‹ค. ์ด๋ฒˆ์— โ€˜์•„๋ฐ”ํƒ€2โ€™๊ฐ€ 1000๋งŒ๋ช…์„ ๋ŒํŒŒํ•œ๋‹ค๋ฉด 2019๋…„ ๊ฐœ๋ด‰ํ•œ โ€˜์–ด๋ฒค์ ธ์Šค: ์—”๋“œ๊ฒŒ์ž„โ€™ ์ดํ›„ 5๋…„ ๋งŒ์— ์ฒซ 1000๋งŒ ๊ตญ๋‚ด ๊ฐœ๋ด‰ ์™ธ๊ตญ ์˜ํ™”๊ฐ€ ๋œ๋‹ค. ์˜์ง„์œ„์˜ ํ†ตํ•ฉ์ „์‚ฐ๋ง์— ๋”ฐ๋ฅด๋ฉด โ€˜์•„๋ฐ”ํƒ€2โ€™์˜ ๊ตญ๋‚ด ์‹ค์‹œ๊ฐ„ ์˜ˆ๋งค์œจ์€ 53.9%(4์ผ ์˜ค์ „ 10์‹œ ๊ธฐ์ค€)๋กœ ์ด๋‚  ๊ฐœ๋ด‰ํ•œ ์ผ๋ณธ ์˜ํ™” โ€˜๋” ํผ์ŠคํŠธ ์Šฌ๋žจ๋ฉํฌโ€™(12.9%)๋ณด๋‹ค ์•ฝ 4๋ฐฐ ์ด์ƒ ๋†’์€ ์˜ˆ๋งค์œจ์„ ๊ธฐ๋กํ–ˆ๋‹ค. ๋ฐ•์Šค์˜คํ”ผ์Šค 2์œ„๋Š” ์ •์„ฑํ™” ์ฃผ์—ฐ์˜ ํ•œ๊ตญ ๋ฎค์ง€์ปฌ ์˜ํ™” โ€˜์˜์›…โ€™์œผ๋กœ, ๋ˆ„์  ๊ด€๊ฐ 180๋งŒ๋ช…์„ ๊ธฐ๋กํ–ˆ๋‹ค. ์ด์–ด ์ž‘๋…„ 11์›” ๋ง ๊ฐœ๋ด‰ํ•œ ์ผ๋ณธ ์˜ํ™” โ€˜์˜ค๋Š˜ ๋ฐค, ์„ธ๊ณ„์—์„œ ์ด ์‚ฌ๋ž‘์ด ์‚ฌ๋ผ์ง„๋‹ค ํ•ด๋„โ€™๊ฐ€ ๋ˆ„์  ๊ด€๊ฐ 72๋งŒ๋ช…์œผ๋กœ 3์œ„๋ฅผ ์œ ์ง€ํ•˜๊ณ  ์žˆ๋‹ค.""",
    ]
    
    kss.summarize_sentences(texts)
    # [['๊ฐœ๊ทธ๋งจ ๊ฒธ ๊ฐ€์ˆ˜ โ€˜๊ฐœ๊ฐ€์ˆ˜โ€™ UV ์œ ์„ธ์œค์ด ์‹ ๊ณก ๋ฐœ๋งค ์ดํ›„ ๋งŽ์€ ๋‚จํŽธ๋“ค์˜ ์‘์›์„ ๋ฐ›๊ณ  ์žˆ๋‹ค.', 'โ€˜๋งˆ๋” ์‚ฌ์ปคโ€™๋Š” ์•„๋‚ด์— ๋Œ€ํ•œ ์„œ์šดํ•œ ๋งˆ์Œ์„ ์œ„ํŠธ ์žˆ๊ณ  ๊ฐ•ํ•œ ์–ด์กฐ๋กœ ๋””์Šค ํ•˜๋Š” ๋‚จํŽธ ์œ ์„ธ์œค์˜ ๋งˆ์Œ์„ ๋‹ด์€ ๊ณก์ด๋‹ค.', '์œ ์„ธ์œค์€ โ€˜๋งˆ๋” ์‚ฌ์ปคโ€™๋ฅผ ํ†ตํ•ด ๋‚จํŽธ๋“ค์˜ ๋งˆ์Œ์„ ๋Œ€๋ณ€ํ•ด ์ฃผ๊ณ  ์žˆ๋Š” ํ•œํŽธ ์•„๋‚ด์˜ ๋ฐ˜์‘์€ ์–ด๋–จ์ง€ ๊ถ๊ธˆ์ฆ์„ ๋ชจ์€๋‹ค.'], 
    # ['์ œ์ž„์Šค ์นด๋ฉ”๋ก  ๊ฐ๋…์˜ ์˜ํ™” โ€˜์•„๋ฐ”ํƒ€: ๋ฌผ์˜ ๊ธธโ€™(์•„๋ฐ”ํƒ€2)์ด ๊ฐœ๋ด‰ 21์ผ ๋งŒ์— ์ „๊ตญ ๋ˆ„์  ๊ด€๊ฐ 800๋งŒ๋ช…์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค.', '4์ผ ์˜ํ™”์ง„ํฅ์œ„์›ํšŒ(์˜์ง„์œ„)์— ๋”ฐ๋ฅด๋ฉด ์ง€๋‚œ๋‹ฌ 14์ผ ๊ฐœ๋ด‰ํ•œ โ€˜์•„๋ฐ”ํƒ€2โ€™๋Š” ์ „๋‚  11๋งŒ3902๋ช…์˜ ๊ด€๊ฐ์„ ๋ชจ์•˜๋‹ค.', '๋ฐ•์Šค์˜คํ”ผ์Šค 2์œ„๋Š” ์ •์„ฑํ™” ์ฃผ์—ฐ์˜ ํ•œ๊ตญ ๋ฎค์ง€์ปฌ ์˜ํ™” โ€˜์˜์›…โ€™์œผ๋กœ, ๋ˆ„์  ๊ด€๊ฐ 180๋งŒ๋ช…์„ ๊ธฐ๋กํ–ˆ๋‹ค.']]
  • Set max_sentences if you want get more or less three sentences from text

    import kss
    
    text = """๊ฐœ๊ทธ๋งจ ๊ฒธ ๊ฐ€์ˆ˜ โ€˜๊ฐœ๊ฐ€์ˆ˜โ€™ UV ์œ ์„ธ์œค์ด ์‹ ๊ณก ๋ฐœ๋งค ์ดํ›„ ๋งŽ์€ ๋‚จํŽธ๋“ค์˜ ์‘์›์„ ๋ฐ›๊ณ  ์žˆ๋‹ค. ์œ ์„ธ์œค์€ ์ง€๋‚œ 3์ผ ์˜คํ›„ 6์‹œ ์ƒˆ ์‹ฑ๊ธ€ โ€˜๋งˆ๋” ์‚ฌ์ปค(Mother Soccer)(Feat. ์ˆ˜ํผ๋น„)โ€™๋ฅผ ๋ฐœ๋งคํ–ˆ๋‹ค. โ€˜๋งˆ๋” ์‚ฌ์ปคโ€™๋Š” ์•„๋‚ด์— ๋Œ€ํ•œ ์„œ์šดํ•œ ๋งˆ์Œ์„ ์œ„ํŠธ ์žˆ๊ณ  ๊ฐ•ํ•œ ์–ด์กฐ๋กœ ๋””์Šค ํ•˜๋Š” ๋‚จํŽธ ์œ ์„ธ์œค์˜ ๋งˆ์Œ์„ ๋‹ด์€ ๊ณก์ด๋‹ค. ๋ฐœ๋งค ํ›„ ์†Œ์…œ ๋ฏธ๋””์–ด ์ƒ์—์„œ ํ™”์ œ๋ฅผ ๋ชจ์œผ๊ณ  ์žˆ๋Š” ๊ฐ€์šด๋ฐ, ๊ฐ€์ˆ˜ ํ•˜๋™๊ท ์€ โ€œ์œ ์„ธ์œ ๋‹ˆ ๊ดœ์ฐฎ๊ฒ ์–ดโ€๋ผ๋Š” ๋ฐ˜์‘์„ ๋ณด์ด๊ธฐ๋„ ํ–ˆ๋‹ค. ๋ˆ„๋ฆฌ๊พผ๋“ค์€ โ€˜๋‘ ๋ถ„์˜ ์›๋งŒํ•œ ํ•ฉ์˜๊ฐ€ ์žˆ๊ธฐ๋ฅผ ๋ฐ”๋ž๋‹ˆ๋‹คโ€™, โ€˜์ง‘์—๋Š” ๋“ค์–ด๊ฐˆ ์ˆ˜ ์žˆ๊ฒ ๋‚˜โ€™ ๋“ฑ ์œ ์„ธ์œค์˜ ๊ท€๊ฐ€๋ฅผ ๊ฑฑ์ •ํ•˜๋Š” ๋ชจ์Šต์„ ๋ณด์˜€๋‹ค. ์œ ์„ธ์œค์€ ์ ์ž…๊ฐ€๊ฒฝ์œผ๋กœ โ€˜๋งˆ๋” ์‚ฌ์ปคโ€™ ์ฑŒ๋ฆฐ์ง€๋ฅผ ์‹œ์ž‘, ์ž์‹ ์˜ SNS๋ฅผ ํ†ตํ•ด โ€œ๋ถ€๋ถ€ ์‹ธ์›€์ด ์ข€ ์ปค์กŒ๋„ค์š”โ€๋ผ๋ฉฐ ๋ฐฐ์šฐ ์†ก์ง„์šฐ์™€ ํ•จ๊ป˜ ์ดฌ์˜ํ•œ ์˜์ƒ์„ ๊ฒŒ์žฌํ–ˆ๋‹ค. ํ•ด๋‹น ์˜์ƒ์—์„œ๋Š” ์–‘๋ง์„ ์‹ ๊ณ  ์นจ๋Œ€์— ๋“ค์–ด๊ฐ„ ๋’ค ํ™˜ํ˜ธ๋ฅผ ์ง€๋ฅด๊ฑฐ๋‚˜ ํ™”์žฅ์‹ค ๋ถˆ์„ ๋„์ง€ ์•Š๊ณ  ๋„๋ง๊ฐ€๋Š” ๋“ฑ ์•„๋‚ด์˜ ์ž”์†Œ๋ฆฌ ์œ ๋ฐœ ํฌ์ธํŠธ๋ฅผ ์‚ด๋ ค ์žฌ์น˜ ์žˆ๋Š” ์˜์ƒ์„ ์™„์„ฑํ–ˆ๋‹ค. ์œ ์„ธ์œค์€ โ€˜๋งˆ๋” ์‚ฌ์ปคโ€™๋ฅผ ํ†ตํ•ด ๋‚จํŽธ๋“ค์˜ ๋งˆ์Œ์„ ๋Œ€๋ณ€ํ•ด ์ฃผ๊ณ  ์žˆ๋Š” ํ•œํŽธ ์•„๋‚ด์˜ ๋ฐ˜์‘์€ ์–ด๋–จ์ง€ ๊ถ๊ธˆ์ฆ์„ ๋ชจ์€๋‹ค."""
    
    kss.summarize_sentences(text, max_sentences=4)
    # ['๊ฐœ๊ทธ๋งจ ๊ฒธ ๊ฐ€์ˆ˜ โ€˜๊ฐœ๊ฐ€์ˆ˜โ€™ UV ์œ ์„ธ์œค์ด ์‹ ๊ณก ๋ฐœ๋งค ์ดํ›„ ๋งŽ์€ ๋‚จํŽธ๋“ค์˜ ์‘์›์„ ๋ฐ›๊ณ  ์žˆ๋‹ค.', 'โ€˜๋งˆ๋” ์‚ฌ์ปคโ€™๋Š” ์•„๋‚ด์— ๋Œ€ํ•œ ์„œ์šดํ•œ ๋งˆ์Œ์„ ์œ„ํŠธ ์žˆ๊ณ  ๊ฐ•ํ•œ ์–ด์กฐ๋กœ ๋””์Šค ํ•˜๋Š” ๋‚จํŽธ ์œ ์„ธ์œค์˜ ๋งˆ์Œ์„ ๋‹ด์€ ๊ณก์ด๋‹ค.', '์œ ์„ธ์œค์€ ์ ์ž…๊ฐ€๊ฒฝ์œผ๋กœ โ€˜๋งˆ๋” ์‚ฌ์ปคโ€™ ์ฑŒ๋ฆฐ์ง€๋ฅผ ์‹œ์ž‘, ์ž์‹ ์˜ SNS๋ฅผ ํ†ตํ•ด โ€œ๋ถ€๋ถ€ ์‹ธ์›€์ด ์ข€ ์ปค์กŒ๋„ค์š”โ€๋ผ๋ฉฐ ๋ฐฐ์šฐ ์†ก์ง„์šฐ์™€ ํ•จ๊ป˜ ์ดฌ์˜ํ•œ ์˜์ƒ์„ ๊ฒŒ์žฌํ–ˆ๋‹ค.', '์œ ์„ธ์œค์€ โ€˜๋งˆ๋” ์‚ฌ์ปคโ€™๋ฅผ ํ†ตํ•ด ๋‚จํŽธ๋“ค์˜ ๋งˆ์Œ์„ ๋Œ€๋ณ€ํ•ด ์ฃผ๊ณ  ์žˆ๋Š” ํ•œํŽธ ์•„๋‚ด์˜ ๋ฐ˜์‘์€ ์–ด๋–จ์ง€ ๊ถ๊ธˆ์ฆ์„ ๋ชจ์€๋‹ค.']
Why text summarization in Kss?

There's textrankr, a text summarization module for Korean. So someone might ask me like "Why are you adding summarization feature into Kss?". The reason of adding this feature is sentence segmentation performance is very important in text summarization domain.

Before summarize text into sentences, we must split text into sentences. but textrankr has been split sentences using very naive regex based method, and this makes text summarization performance poorly. In addition, user must input tokenizer into the TextRank class, but this is a little bit bothering. So I fixed the two problems of textrankr, and added the codebase into Kss.

Kss has one of the best sentence segmentation module in all of the Korean language processing libraries, and this can improve text summarization performance without modifying any summarization related algorithms in textrankr.

Let's see the following example.

text = """์–ด๋Šํ™”์ฐฝํ•œ๋‚  ์ถœ๊ทผ์ „์— ๋„ˆ๋ฌด์ผ์ฐ์ผ์–ด๋‚˜ ๋ฒ„๋ ธ์Œ (์ถœ๊ทผ์‹œ๊ฐ„ 19์‹œ)
ํ• ๊บผ๋„์—†๊ณ ํ•ด์„œ ์นดํŽ˜๋ฅผ ์ฐพ์•„ ์‹œ๋‚ด๋กœ ๋‚˜๊ฐ”์Œ
์ƒˆ๋กœ์ƒ๊ธด๊ณณ์— ์‚ฌ์žฅ๋‹˜์ด ์ปคํ”ผ์„ ์ˆ˜์ธ์ง€ ์ปคํ”ผ๋ฐ•์‚ฌ๋ผ๊ณ  ํ•ด์„œ ๊ฐ”์Œ
์˜คํ”ˆํ•œ์ง€ ์–ผ๋งˆ์•ˆ๋˜์„œ ๊ทธ๋Ÿฐ์ง€ ์†๋‹˜์ด ์–ผ๋งˆ์—†์—ˆ์Œ
์กฐ์šฉํ•˜๊ณ  ์ข‹๋‹ค๋ฉฐ ์ข‹์•„ํ•˜๋Š”๊ฑธ์‹œ์ผœ์„œ ํ…Œ๋ผ์Šค์— ์•‰์Œ"""

Output of textrankr is:

import textrankr
import mecab

tokenizer = mecab.MeCab().morphs
textrankr_class = textrankr.TextRank(tokenizer=tokenizer)
textrankr_output = textrankr_class.summarize(text, verbose=False)
print(textrankr_output)
output:

['์–ด๋Šํ™”์ฐฝํ•œ๋‚  ์ถœ๊ทผ์ „์— ๋„ˆ๋ฌด์ผ์ฐ์ผ์–ด๋‚˜ ๋ฒ„๋ ธ์Œ (์ถœ๊ทผ์‹œ๊ฐ„ 19์‹œ) ํ• ๊บผ๋„์—†๊ณ ํ•ด์„œ ์นดํŽ˜๋ฅผ ์ฐพ์•„ ์‹œ๋‚ด๋กœ ๋‚˜๊ฐ”์Œ ์ƒˆ๋กœ์ƒ๊ธด๊ณณ์— ์‚ฌ์žฅ๋‹˜์ด ์ปคํ”ผ์„ ์ˆ˜์ธ์ง€ ์ปคํ”ผ๋ฐ•์‚ฌ๋ผ๊ณ  ํ•ด์„œ ๊ฐ”์Œ ์˜คํ”ˆํ•œ์ง€ ์–ผ๋งˆ์•ˆ๋˜์„œ ๊ทธ๋Ÿฐ์ง€ ์†๋‹˜์ด ์–ผ๋งˆ์—†์—ˆ์Œ ์กฐ์šฉํ•˜๊ณ  ์ข‹๋‹ค๋ฉฐ ์ข‹์•„ํ•˜๋Š”๊ฑธ์‹œ์ผœ์„œ ํ…Œ๋ผ์Šค์— ์•‰์Œ ๊ทผ๋ฐ ์กฐ์šฉํ•˜๋˜ ์นดํŽ˜๊ฐ€ ์‚ฐ๋งŒํ•ด์ง ์†Œ๋ฆฌ์˜ ์ถœ์ฒ˜๋Š” ์นด์šดํ„ฐ์˜€์Œ(ํ…Œ๋ผ์Šค๊ฐ€ ์นด์šดํ„ฐ ๋ฐ”๋กœ์˜†)']

Output of kss is:

import kss

kss.sumarize_sentences(text)
output:

['ํ• ๊บผ๋„์—†๊ณ ํ•ด์„œ ์นดํŽ˜๋ฅผ ์ฐพ์•„ ์‹œ๋‚ด๋กœ ๋‚˜๊ฐ”์Œ', '์ƒˆ๋กœ์ƒ๊ธด๊ณณ์— ์‚ฌ์žฅ๋‹˜์ด ์ปคํ”ผ์„ ์ˆ˜์ธ์ง€ ์ปคํ”ผ๋ฐ•์‚ฌ๋ผ๊ณ  ํ•ด์„œ ๊ฐ”์Œ', '์กฐ์šฉํ•˜๊ณ  ์ข‹๋‹ค๋ฉฐ ์ข‹์•„ํ•˜๋Š”๊ฑธ์‹œ์ผœ์„œ ํ…Œ๋ผ์Šค์— ์•‰์Œ']

You can see textrankr failed summarizing text because it couldn't split input text into sentences. but Kss summarized text very well. And usage of kss is also much easier than textrankr! That's why I am adding this feature into Kss.


Kss in various programming languages

Kss is available in various programming languages.

Citation

If you find this toolkit useful, please consider citing:

@misc{kss,
  author       = {Ko, Hyunwoong and Park, Sang-kil},
  title        = {Kss: A Toolkit for Korean sentence segmentation},
  howpublished = {\url{https://github.com/hyunwoongko/kss}},
  year         = {2021},
}

License

Kss project is licensed under the terms of the BSD 3-Clause "New" or "Revised" License.

Copyright 2021 Hyunwoong Ko and Sang-kil Park. All Rights Reserved.

More Repositories

1

transformer

Transformer: PyTorch Implementation of "Attention Is All You Need"
Python
2,188
star
2

kochat

Opensource Korean chatbot framework
Python
443
star
3

openchat

OpenChat: Easy to use opensource chatting framework via neural networks
Python
440
star
4

pecab

Pecab: Pure python Korean morpheme analyzer based on Mecab
Python
153
star
5

kocrawl

Collection of useful Korean crawlers
Python
86
star
6

nlp-datasets

Curation note of NLP datasets
81
star
7

summarizers

Package for controllable summarization
Python
78
star
8

python-mecab-kor

Yet another python binding for mecab-ko
Python
77
star
9

kobart-transformers

Kobart model on Huggingface transformers
Python
63
star
10

asian-bart

Asian language bart models (En, Ja, Ko, Zh, ECJK)
Python
63
star
11

gpt2-tokenizer-java

Java implementation of GPT2 tokenizer.
Java
62
star
12

bert2bert-summarization

Abstractive summarization using Bert2Bert framework.
Python
31
star
13

megatron-11b

Megatron LM 11B on Huggingface Transformers
Python
27
star
14

pydatrie

Pure python implementation of DARTS (Double ARray Trie System)
Python
22
star
15

retro

An implementation of retrieval-enhanced transformer based on Hugging Face Transformers and FAISS
18
star
16

bigdata-lecture

2020 CBNU summer vacation data campus machine learning lecture materials
Jupyter Notebook
16
star
17

dialobot

Opensource chatbot framework
Python
16
star
18

beyond-lm

Beyond LM: How can language model go forward in the future?
Python
15
star
19

stop-sequencer

Implementation of stop sequencer for Huggingface Transformers
Python
15
star
20

resnext-parallel

Parallel support implementation of "aggregated residual transformations for deep neural networks" using keras
Python
10
star
21

citrus-pest-disease-recognition

Citrus pest disease recognition app based on deep learning
Java
9
star
22

instruct-tuning-example

Instruct tuning example using Hugging Face Transformers and TRL
Python
8
star
23

social-robot-bao

Artificial intelligence robot for children with autism
Java
8
star
24

still-alive

Still alive application decompiled sources
Java
5
star
25

reactive-streams

Asynchronous reactive-streams framework for java
Java
5
star
26

rx-firebase

Android mvvm template with my own implementation of rx-firebase
Java
5
star
27

low-saturation-image-classifier

Classifier whether the image has low saturation or not
Python
5
star
28

lets-kirin

Chatbot app that manage electric home devices
Java
4
star
29

brain-training

Android brain training game app for kakao
Java
4
star
30

titanic

Kaggle : predicting titanic survivors
Jupyter Notebook
4
star
31

movie-recommender

Movie recommendation system using metadata
Jupyter Notebook
4
star
32

hyunwoongko

miscellaneous
2
star
33

strabismus-recognition

Strabismus recognition module based on machine learning
HTML
2
star
34

code-pipeline

Test repository for AWS code-pipeline deployment.
Dockerfile
1
star