• Stars
    star
    594
  • Rank 72,323 (Top 2 %)
  • Language
    Scala
  • License
    Apache License 2.0
  • Created over 7 years ago
  • Updated about 1 month ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Open Korean Text Processor - An Open-source Korean Text Processor

open-korean-text Coverage Status Build Status License

Open-source Korean Text Processor / ์˜คํ”ˆ์†Œ์Šค ํ•œ๊ตญ์–ด ์ฒ˜๋ฆฌ๊ธฐ (Official Fork of twitter-korean-text)

Scala/Java library to process Korean text with a Java wrapper. open-korean-text currently provides Korean normalization and tokenization. Please join our community at Google Forum. The intent of this text processor is not limited to short tweet texts.

์Šค์นผ๋ผ๋กœ ์“ฐ์—ฌ์ง„ ํ•œ๊ตญ์–ด ์ฒ˜๋ฆฌ๊ธฐ์ž…๋‹ˆ๋‹ค. ํ˜„์žฌ ํ…์ŠคํŠธ ์ •๊ทœํ™”์™€ ํ˜•ํƒœ์†Œ ๋ถ„์„, ์Šคํ…Œ๋ฐ์„ ์ง€์›ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์งง์€ ํŠธ์œ—์€ ๋ฌผ๋ก ์ด๊ณ  ๊ธด ๊ธ€๋„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐœ๋ฐœ์— ์ฐธ์—ฌํ•˜์‹œ๊ณ  ์‹ถ์€ ๋ถ„์€ Google Forum์— ๊ฐ€์ž…ํ•ด ์ฃผ์„ธ์š”. ์‚ฌ์šฉ๋ฒ•์„ ์•Œ๊ณ ์ž ํ•˜์‹œ๋Š” ์ดˆ๋ณด๋ถ€ํ„ฐ ์ฝ”๋“œ์— ์ฐธ์—ฌํ•˜๊ณ  ์‹ถ์œผ์‹  ๋ถ„๋“ค๊นŒ์ง€ ๋ชจ๋‘ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค.

์„ค์น˜ ๋ฐ ์ˆ˜์ •ํ•˜๋Š” ๋ฐฉ๋ฒ• ์ƒ์„ธ ์•ˆ๋‚ด

open-korean-text์˜ ๋ชฉํ‘œ๋Š” ๋น…๋ฐ์ดํ„ฐ ๋“ฑ์—์„œ ๊ฐ„๋‹จํ•œ ํ•œ๊ตญ์–ด ์ฒ˜๋ฆฌ๋ฅผ ํ†ตํ•ด ์ƒ‰์ธ์–ด๋ฅผ ์ถ”์ถœํ•˜๋Š” ๋ฐ์— ์žˆ์Šต๋‹ˆ๋‹ค. ์™„์ „ํ•œ ์ˆ˜์ค€์˜ ํ˜•ํƒœ์†Œ ๋ถ„์„์„ ์ง€ํ–ฅํ•˜์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค.

open-korean-text๋Š” normalization, tokenization, stemming, phrase extraction ์ด๋ ‡๊ฒŒ ๋„ค๊ฐ€์ง€ ๊ธฐ๋Šฅ์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

์ •๊ทœํ™” normalization (์ž…๋‹ˆ๋‹ผใ…‹ใ…‹ -> ์ž…๋‹ˆ๋‹ค ใ…‹ใ…‹, ์ƒค๋ฆ‰ํ•ด -> ์‚ฌ๋ž‘ํ•ด)

  • ํ•œ๊ตญ์–ด๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ผใ…‹ใ…‹ใ…‹ใ…‹ใ…‹ -> ํ•œ๊ตญ์–ด๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค ใ…‹ใ…‹

ํ† ํฐํ™” tokenization

  • ํ•œ๊ตญ์–ด๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค ใ…‹ใ…‹ -> ํ•œ๊ตญ์–ดNoun, ๋ฅผJosa, ์ฒ˜๋ฆฌNoun, ํ•˜๋Š”Verb, ์˜ˆ์‹œNoun, ์ž…๋‹ˆ๋‹คAdjective(์ด๋‹ค), ใ…‹ใ…‹KoreanParticle

์–ด๊ทผํ™” stemming (์ž…๋‹ˆ๋‹ค -> ์ด๋‹ค)

  • ํ•œ๊ตญ์–ด๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค ใ…‹ใ…‹ -> ํ•œ๊ตญ์–ดNoun, ๋ฅผJosa, ์ฒ˜๋ฆฌNoun, ํ•˜๋‹คVerb, ์˜ˆ์‹œNoun, ์ด๋‹คAdjective, ใ…‹ใ…‹KoreanParticle

์–ด๊ตฌ ์ถ”์ถœ phrase extraction

  • ํ•œ๊ตญ์–ด๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค ใ…‹ใ…‹ -> ํ•œ๊ตญ์–ด, ์ฒ˜๋ฆฌ, ์˜ˆ์‹œ, ์ฒ˜๋ฆฌํ•˜๋Š” ์˜ˆ์‹œ

Introductory Presentation: Google Slides

Web API Service

open-korean-text-api
์ด API ์„œ๋น„์Šค๋Š” Heroku ์„œ๋ฒ„์—์„œ ์ œ๊ณต๋˜๋ฉฐ(Domain: https://open-korean-text.herokuapp.com/) ํ˜„์žฌ ์ •๊ทœํ™”(normalization), ํ† ํฐํ™”(tokenization), ์–ด๊ทผํ™”(stemmin), ์–ด๊ตฌ ์ถ”์ถœ(phrase extract) ์„œ๋น„์Šค๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๊ฐ ์„œ๋น„์Šค์™€ ์‚ฌ์šฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
normalize, tokenize, stem, extractPhrases ๊ฐ€ ๊ฐ ์„œ๋น„์Šค์˜ Action ์ด ๋˜๋ฉฐ Query parameter ๋Š” text ์ž…๋‹ˆ๋‹ค.

์„œ๋น„์Šค ์‚ฌ์šฉ๋ฒ•
์ •๊ทœํ™” https://open-korean-text-api.herokuapp.com/normalize?text=์˜คํ”ˆ์ฝ”๋ฆฌ์•ˆํ…์ŠคํŠธ
ํ† ํฐํ™” https://open-korean-text-api.herokuapp.com/tokenize?text=์˜คํ”ˆ์ฝ”๋ฆฌ์•ˆํ…์ŠคํŠธ
์–ด๊ทผํ™” https://open-korean-text-api.herokuapp.com/stem?text=์˜คํ”ˆ์ฝ”๋ฆฌ์•ˆํ…์ŠคํŠธ
์–ด๊ตฌ ์ถ”์ถœ https://open-korean-text-api.herokuapp.com/extractPhrases?text=์˜คํ”ˆ์ฝ”๋ฆฌ์•ˆํ…์ŠคํŠธ

Semantic Versioning

1.0.2 (Major.Minor.Patch)

Major: API change Minor: Processor behavior change Patch: Bug fixes without a behavior change

API

Maven

To include this in your Maven-based JVM project, add the following lines to your pom.xml: / Maven์„ ์ด์šฉํ•  ๊ฒฝ์šฐ pom.xml์— ๋‹ค์Œ์˜ ๋‚ด์šฉ์„ ์ถ”๊ฐ€ํ•˜์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค:

  <dependency>
    <groupId>org.openkoreantext</groupId>
    <artifactId>open-korean-text</artifactId>
    <version>2.1.0</version>
  </dependency>

Maven Repository: http://mvnrepository.com/artifact/org.openkoreantext/open-korean-text

Support for other languages.

Type Language Contributor
Wrapper .net/C# modamoda
Wrapper Node JS Ch0p
Wrapper Node JS Youngrok Kim
Wrapper Python Jaepil Jeong
Wrapper Clojure Seonho Kim
Wrapper Ruby for Java Version jun85664396
Wrapper Ruby for Scala Version Jaehyun Shin
Porting Python Baeg-il Kim
Package Python Korean NLP KoNLPy
Package Elastic Search socurites
Package Elastic Search Jaehyun Shin

Get the source / ์†Œ์Šค๋ฅผ ์›ํ•˜์‹œ๋Š” ๊ฒฝ์šฐ

Clone the git repo and build using maven. / Git ์ „์ฒด๋ฅผ ํด๋ก ํ•˜๊ณ  Maven์„ ์ด์šฉํ•˜์—ฌ ๋นŒ๋“œํ•ฉ๋‹ˆ๋‹ค.

git clone https://github.com/open-korean-text/open-korean-text.git
cd open-korean-text
mvn compile

Open 'pom.xml' from your favorite IDE.

Basic Usage / ์‚ฌ์šฉ ๋ฐฉ๋ฒ•

You can find these examples in examples folder. / examples ํด๋”์— ์‚ฌ์šฉ ๋ฐฉ๋ฒ• ์˜ˆ์ œ ํŒŒ์ผ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

Running Tests

mvn test will run our unit tests / ๋ชจ๋“  ์œ ๋‹› ํ…Œ์ŠคํŠธ๋ฅผ ์‹คํ–‰ํ•˜๋ ค๋ฉด mvn test๋ฅผ ์ด์šฉํ•ด ์ฃผ์„ธ์š”.

Contribution

Refer to the general contribution guide. We will add this project-specific contribution guide later.

์„ค์น˜ ๋ฐ ์ˆ˜์ •ํ•˜๋Š” ๋ฐฉ๋ฒ• ์ƒ์„ธ ์•ˆ๋‚ด

Performance / ์ฒ˜๋ฆฌ ์†๋„

Tested on Intel i7 2.3 Ghz

Initial loading time (์ดˆ๊ธฐ ๋กœ๋”ฉ ์‹œ๊ฐ„): 2~4 sec

Average time per parsing a chunk (ํ‰๊ท  ์–ด์ ˆ ์ฒ˜๋ฆฌ ์‹œ๊ฐ„): 0.12 ms

Tweets (Avg length ~50 chars)

Tweets 100K 200K 300K 400K 500K 600K 700K 800K 900K 1M
Time in Seconds 57.59 112.09 165.05 218.11 270.54 328.52 381.09 439.71 492.94 542.12

Average per tweet: 0.54212 ms

Benchmark test by KoNLPy

Benchmark test

From http://konlpy.org/ko/v0.4.3/morph/#pos-tagging-with-konlpy

Author

Admin Staff

License

Copyright 2014 Twitter, Inc.

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0