twitter-korean-text
[//]: # (Travis has been deactivated: )
ํธ์ํฐ์์ ๋ง๋ ์คํ์์ค ํ๊ตญ์ด ์ฒ๋ฆฌ๊ธฐ
- 2017๋ 4.4 ๋ฒ์ ์ดํ์ ๊ฐ๋ฐ์ http://openkoreantext.org ์์ ์งํ๋ฉ๋๋ค.
- We now started an official fork at http://openkoreantext.org as of early 2017. All the development after version 4.4 will be done in open-korean-text.
Scala/Java library to process Korean text with a Java wrapper. twitter-korean-text currently provides Korean normalization and tokenization. Please join our community at Google Forum. The intent of this text processor is not limited to short tweet texts.
์ค์นผ๋ผ๋ก ์ฐ์ฌ์ง ํ๊ตญ์ด ์ฒ๋ฆฌ๊ธฐ์ ๋๋ค. ํ์ฌ ํ ์คํธ ์ ๊ทํ์ ํํ์ ๋ถ์, ์คํ ๋ฐ์ ์ง์ํ๊ณ ์์ต๋๋ค. ์งง์ ํธ์์ ๋ฌผ๋ก ์ด๊ณ ๊ธด ๊ธ๋ ์ฒ๋ฆฌํ ์ ์์ต๋๋ค. ๊ฐ๋ฐ์ ์ฐธ์ฌํ์๊ณ ์ถ์ ๋ถ์ Google Forum์ ๊ฐ์ ํด ์ฃผ์ธ์. ์ฌ์ฉ๋ฒ์ ์๊ณ ์ ํ์๋ ์ด๋ณด๋ถํฐ ์ฝ๋์ ์ฐธ์ฌํ๊ณ ์ถ์ผ์ ๋ถ๋ค๊น์ง ๋ชจ๋ ํ์ํฉ๋๋ค.
twitter-korean-text์ ๋ชฉํ๋ ๋น ๋ฐ์ดํฐ ๋ฑ์์ ๊ฐ๋จํ ํ๊ตญ์ด ์ฒ๋ฆฌ๋ฅผ ํตํด ์์ธ์ด๋ฅผ ์ถ์ถํ๋ ๋ฐ์ ์์ต๋๋ค. ์์ ํ ์์ค์ ํํ์ ๋ถ์์ ์งํฅํ์ง๋ ์์ต๋๋ค.
twitter-korean-text๋ normalization, tokenization, stemming, phrase extraction ์ด๋ ๊ฒ ๋ค๊ฐ์ง ๊ธฐ๋ฅ์ ์ง์ํฉ๋๋ค.
์ ๊ทํ normalization (์ ๋๋ผใ ใ -> ์ ๋๋ค ใ ใ , ์ค๋ฆํด -> ์ฌ๋ํด)
- ํ๊ตญ์ด๋ฅผ ์ฒ๋ฆฌํ๋ ์์์ ๋๋ผใ ใ ใ ใ ใ -> ํ๊ตญ์ด๋ฅผ ์ฒ๋ฆฌํ๋ ์์์ ๋๋ค ใ ใ
ํ ํฐํ tokenization
- ํ๊ตญ์ด๋ฅผ ์ฒ๋ฆฌํ๋ ์์์ ๋๋ค ใ ใ -> ํ๊ตญ์ดNoun, ๋ฅผJosa, ์ฒ๋ฆฌNoun, ํ๋Verb, ์์Noun, ์ Adjective, ๋๋คEomi ใ ใ KoreanParticle
์ด๊ทผํ stemming (์ ๋๋ค -> ์ด๋ค)
- ํ๊ตญ์ด๋ฅผ ์ฒ๋ฆฌํ๋ ์์์ ๋๋ค ใ ใ -> ํ๊ตญ์ดNoun, ๋ฅผJosa, ์ฒ๋ฆฌNoun, ํ๋คVerb, ์์Noun, ์ด๋คAdjective, ใ ใ KoreanParticle
์ด๊ตฌ ์ถ์ถ phrase extraction
- ํ๊ตญ์ด๋ฅผ ์ฒ๋ฆฌํ๋ ์์์ ๋๋ค ใ ใ -> ํ๊ตญ์ด, ์ฒ๋ฆฌ, ์์, ์ฒ๋ฆฌํ๋ ์์
Introductory Presentation: Google Slides
Try it here
Gunja Agrawal kindly created a test API webpage for this project: http://gunjaagrawal.com/langhack/
Gunja Agrawal๋์ด ๋ง๋ค์ด์ฃผ์ ํ ์คํธ ์น ํ์ด์ง ์ ๋๋ค. http://gunjaagrawal.com/langhack/
Opensourced here: twitter-korean-tokenizer-api
API
Maven
To include this in your Maven-based JVM project, add the following lines to your pom.xml:
Maven์ ์ด์ฉํ ๊ฒฝ์ฐ pom.xml์ ๋ค์์ ๋ด์ฉ์ ์ถ๊ฐํ์๋ฉด ๋ฉ๋๋ค:
<dependency>
<groupId>com.twitter.penguin</groupId>
<artifactId>korean-text</artifactId>
<version>4.4</version>
</dependency>
The maven site is available here http://twitter.github.io/twitter-korean-text/ and scaladocs are here http://twitter.github.io/twitter-korean-text/scaladocs/
Support for other languages.
.net
modamoda kindly offered a .net wrapper: https://github.com/modamoda/TwitterKoreanProcessorCS
node.js
Ch0p kindly offered a node.js wrapper: twtkrjs
Youngrok Kim kindly offered a node.js wrapper: node-twitter-korean-text
Python
Baeg-il Kim kindly offered a Python version: https://github.com/cedar101/twitter-korean-py
Jaepil Jeong kindly offered a Python wrapper: https://github.com/jaepil/twkorean
- Python Korean NLP project KoNLPy now includes twitter-korean-text. ํ์ด์ฌ์์ ์ฌ์ด ํ์ฉ์ด ๊ฐ๋ฅํ KoNLPy ํจํค์ง์ twkorean์ด ํฌํจ๋์์ต๋๋ค.
Ruby
jun85664396 kindly offered a Ruby wrapper: twitter-korean-text-ruby
- This provides access to com.twitter.penguin.korean.TwitterKoreanProcessorJava (Java wrapper).
Jaehyun Shin kindly offered a Ruby wrapper: twitter-korean-text-ruby
- This provides access to com.twitter.penguin.korean.TwitterKoreanProcessor (Original Scala Class).
Elastic Search
socurites's Korean analyzer for elasticsearch based on twitter-korean-text: tkt-elasticsearch
Get the source ์์ค๋ฅผ ์ํ์๋ ๊ฒฝ์ฐ
Clone the git repo and build using maven.
Git ์ ์ฒด๋ฅผ ํด๋ก ํ๊ณ Maven์ ์ด์ฉํ์ฌ ๋น๋ํฉ๋๋ค.
git clone https://github.com/twitter/twitter-korean-text.git
cd twitter-korean-text
mvn compile
Open 'pom.xml' from your favorite IDE.
Usage ์ฌ์ฉ ๋ฐฉ๋ฒ
You can find these examples in examples folder.
examples ํด๋์ ์ฌ์ฉ ๋ฐฉ๋ฒ ์์ ํ์ผ์ด ์์ต๋๋ค.
from Scala
import com.twitter.penguin.korean.TwitterKoreanProcessor
import com.twitter.penguin.korean.phrase_extractor.KoreanPhraseExtractor.KoreanPhrase
import com.twitter.penguin.korean.tokenizer.KoreanTokenizer.KoreanToken
object ScalaTwitterKoreanTextExample {
def main(args: Array[String]) {
val text = "ํ๊ตญ์ด๋ฅผ ์ฒ๋ฆฌํ๋ ์์์
๋๋ผใ
ใ
ใ
ใ
ใ
#ํ๊ตญ์ด"
// Normalize
val normalized: CharSequence = TwitterKoreanProcessor.normalize(text)
println(normalized)
// ํ๊ตญ์ด๋ฅผ ์ฒ๋ฆฌํ๋ ์์์
๋๋คใ
ใ
#ํ๊ตญ์ด
// Tokenize
val tokens: Seq[KoreanToken] = TwitterKoreanProcessor.tokenize(normalized)
println(tokens)
// List(ํ๊ตญ์ด(Noun: 0, 3), ๋ฅผ(Josa: 3, 1), (Space: 4, 1), ์ฒ๋ฆฌ(Noun: 5, 2), ํ๋(Verb: 7, 2), (Space: 9, 1), ์์(Noun: 10, 2), ์
๋(Adjective: 12, 2), ๋ค(Eomi: 14, 1), ใ
ใ
(KoreanParticle: 15, 2), (Space: 17, 1), #ํ๊ตญ์ด(Hashtag: 18, 4))
// Stemming
val stemmed: Seq[KoreanToken] = TwitterKoreanProcessor.stem(tokens)
println(stemmed)
// List(ํ๊ตญ์ด(Noun: 0, 3), ๋ฅผ(Josa: 3, 1), (Space: 4, 1), ์ฒ๋ฆฌ(Noun: 5, 2), ํ๋ค(Verb: 7, 2), (Space: 9, 1), ์์(Noun: 10, 2), ์ด๋ค(Adjective: 12, 3), ใ
ใ
(KoreanParticle: 15, 2), (Space: 17, 1), #ํ๊ตญ์ด(Hashtag: 18, 4))
// Phrase extraction
val phrases: Seq[KoreanPhrase] = TwitterKoreanProcessor.extractPhrases(tokens, filterSpam = true, enableHashtags = true)
println(phrases)
// List(ํ๊ตญ์ด(Noun: 0, 3), ์ฒ๋ฆฌ(Noun: 5, 2), ์ฒ๋ฆฌํ๋ ์์(Noun: 5, 7), ์์(Noun: 10, 2), #ํ๊ตญ์ด(Hashtag: 18, 4))
}
}
from Java
import java.util.List;
import scala.collection.Seq;
import com.twitter.penguin.korean.TwitterKoreanProcessor;
import com.twitter.penguin.korean.TwitterKoreanProcessorJava;
import com.twitter.penguin.korean.phrase_extractor.KoreanPhraseExtractor;
import com.twitter.penguin.korean.tokenizer.KoreanTokenizer;
public class JavaTwitterKoreanTextExample {
public static void main(String[] args) {
String text = "ํ๊ตญ์ด๋ฅผ ์ฒ๋ฆฌํ๋ ์์์
๋๋ผใ
ใ
ใ
ใ
ใ
#ํ๊ตญ์ด";
// Normalize
CharSequence normalized = TwitterKoreanProcessorJava.normalize(text);
System.out.println(normalized);
// ํ๊ตญ์ด๋ฅผ ์ฒ๋ฆฌํ๋ ์์์
๋๋คใ
ใ
#ํ๊ตญ์ด
// Tokenize
Seq<KoreanTokenizer.KoreanToken> tokens = TwitterKoreanProcessorJava.tokenize(normalized);
System.out.println(TwitterKoreanProcessorJava.tokensToJavaStringList(tokens));
// [ํ๊ตญ์ด, ๋ฅผ, ์ฒ๋ฆฌ, ํ๋, ์์, ์
๋, ๋ค, ใ
ใ
, #ํ๊ตญ์ด]
System.out.println(TwitterKoreanProcessorJava.tokensToJavaKoreanTokenList(tokens));
// [ํ๊ตญ์ด(Noun: 0, 3), ๋ฅผ(Josa: 3, 1), (Space: 4, 1), ์ฒ๋ฆฌ(Noun: 5, 2), ํ๋(Verb: 7, 2), (Space: 9, 1), ์์(Noun: 10, 2), ์
๋(Adjective: 12, 2), ๋ค(Eomi: 14, 1), ใ
ใ
(KoreanParticle: 15, 2), (Space: 17, 1), #ํ๊ตญ์ด(Hashtag: 18, 4)]
// Stemming
Seq<KoreanTokenizer.KoreanToken> stemmed = TwitterKoreanProcessorJava.stem(tokens);
System.out.println(TwitterKoreanProcessorJava.tokensToJavaStringList(stemmed));
// [ํ๊ตญ์ด, ๋ฅผ, ์ฒ๋ฆฌ, ํ๋ค, ์์, ์ด๋ค, ใ
ใ
, #ํ๊ตญ์ด]
System.out.println(TwitterKoreanProcessorJava.tokensToJavaKoreanTokenList(stemmed));
// [ํ๊ตญ์ด(Noun: 0, 3), ๋ฅผ(Josa: 3, 1), (Space: 4, 1), ์ฒ๋ฆฌ(Noun: 5, 2), ํ๋ค(Verb: 7, 2), (Space: 9, 1), ์์(Noun: 10, 2), ์ด๋ค(Adjective: 12, 3), ใ
ใ
(KoreanParticle: 15, 2), (Space: 17, 1), #ํ๊ตญ์ด(Hashtag: 18, 4)]
// Phrase extraction
List<KoreanPhraseExtractor.KoreanPhrase> phrases = TwitterKoreanProcessorJava.extractPhrases(tokens, true, true);
System.out.println(phrases);
// [ํ๊ตญ์ด(Noun: 0, 3), ์ฒ๋ฆฌ(Noun: 5, 2), ์ฒ๋ฆฌํ๋ ์์(Noun: 5, 7), ์์(Noun: 10, 2), #ํ๊ตญ์ด(Hashtag: 18, 4)]
}
}
Basics
TwitterKoreanProcessor.scala is the central object that provides the interface for all the features.
TwitterKoreanProcessor.scala์ ์ง์ํ๋ ๋ชจ๋ ๊ธฐ๋ฅ์ ๋ชจ์ ๋์์ต๋๋ค.
Running Tests
mvn test
will run our unit tests
๋ชจ๋ ์ ๋ ํ
์คํธ๋ฅผ ์คํํ๋ ค๋ฉด mvn test
๋ฅผ ์ด์ฉํด ์ฃผ์ธ์.
Tools
We provide tools for quality assurance and test resources. They can be found under src/main/scala/com/twitter/penguin/korean/qa and src/main/scala/com/twitter/penguin/korean/tools.
Contribution
Refer to the general contribution guide. We will add this project-specific contribution guide later.
์ค์น ๋ฐ ์์ ํ๋ ๋ฐฉ๋ฒ ์์ธ ์๋ด
Performance ์ฒ๋ฆฌ ์๋
Tested on Intel i7 2.3 Ghz
Initial loading time (์ด๊ธฐ ๋ก๋ฉ ์๊ฐ): 2~4 sec
Average time per parsing a chunk (ํ๊ท ์ด์ ์ฒ๋ฆฌ ์๊ฐ): 0.12 ms
Tweets (Avg length ~50 chars)
Tweets | 100K | 200K | 300K | 400K | 500K | 600K | 700K | 800K | 900K | 1M |
---|---|---|---|---|---|---|---|---|---|---|
Time in Seconds | 57.59 | 112.09 | 165.05 | 218.11 | 270.54 | 328.52 | 381.09 | 439.71 | 492.94 | 542.12 |
Average per tweet: 0.54212 ms |
Benchmark test by KoNLPy
From http://konlpy.org/ko/v0.4.2/morph/
Author(s)
- Will Hohyon Ryu (์ ํธํ): https://github.com/nlpenguin | https://twitter.com/NLPenguin
License
Copyright 2014 Twitter, Inc.
Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0