• Stars
    star
    153
  • Rank 234,978 (Top 5 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 1 year ago
  • Updated 8 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Pecab: Pure python Korean morpheme analyzer based on Mecab

Pecab

GitHub release Issues Action Status Windows Action Status Ubuntu Action Status macOS

Pecab is pure python Korean morpheme analyzer based on Mecab. Mecab is a CRF-based morpheme analyzer made by Taku Kudo in 2011. It is very fast and accurate at the same time, which is why it is still very popular even though it is quite old. However, it is known to be one of the most tricky libraries to install, and in fact many people have had a hard time installing Mecab.

So, since a few years ago, I wanted to make a pure python version of Mecab that was easy to install while inheriting the advantages of Mecab. Now, Pecab came out. This ensures results very similar to Mecab and at the same time easy to install. For more details, please refer the following.

Installation

pip install pecab

Usages

The user API of Pecab is inspired by KoNLPy, a one of the most famous natural language processing package in South Korea.

1) PeCab(): creating Pecab object.

from pecab import PeCab

pecab = PeCab()

2) morphs(text): splits text into morphemes.

pecab.morphs("์•„๋ฒ„์ง€๊ฐ€๋ฐฉ์—๋“ค์–ด๊ฐ€์‹œ๋‹ค")
['์•„๋ฒ„์ง€', '๊ฐ€', '๋ฐฉ', '์—', '๋“ค์–ด๊ฐ€', '์‹œ', '๋‹ค']

3) pos(text): returns morphemes and POS tags together.

pecab.pos("์ด๊ฒƒ์€ ๋ฌธ์žฅ์ž…๋‹ˆ๋‹ค.")
[('์ด๊ฒƒ', 'NP'), ('์€', 'JX'), ('๋ฌธ์žฅ', 'NNG'), ('์ž…๋‹ˆ๋‹ค', 'VCP+EF'), ('.', 'SF')]

4) nouns(text): returns all nouns in the input text.

pecab.nouns("์ž์žฅ๋ฉด์„ ๋จน์„๊นŒ? ์งฌ๋ฝ•์„ ๋จน์„๊นŒ? ๊ทธ๊ฒƒ์ด ๊ณ ๋ฏผ์ด๋กœ๋‹ค.")
["์ž์žฅ๋ฉด", "์งฌ๋ฝ•", "๊ทธ๊ฒƒ", "๊ณ ๋ฏผ"]

5) Pecab(user_dict=List[str]): applies an user dictionary.

Note that words included in the user dictionary cannot contain spaces.

  • Without user_dict
from pecab import PeCab

pecab = PeCab()
pecab.pos("์ €๋Š” ์‚ผ์„ฑ๋””์ง€ํ„ธํ”„๋ผ์ž์—์„œ ์ง€ํŽ ๋ƒ‰์žฅ๊ณ ๋ฅผ ์ƒ€์–ด์š”.")
[('์ €', 'NP'), ('๋Š”', 'JX'), ('์‚ผ์„ฑ', 'NNP'), ('๋””์ง€ํ„ธ', 'NNP'), ('ํ”„๋ผ์ž', 'NNP'), ('์—์„œ', 'JKB'), ('์ง€', 'NNP'), ('ํŽ ', 'NNP'), ('๋ƒ‰์žฅ๊ณ ', 'NNG'), ('๋ฅผ', 'JKO'), ('์ƒ€', 'VV+EP'), ('์–ด์š”', 'EF'), ('.', 'SF')]
  • With user_dict
from pecab import PeCab

user_dict = ["์‚ผ์„ฑ๋””์ง€ํ„ธํ”„๋ผ์ž", "์ง€ํŽ ๋ƒ‰์žฅ๊ณ "]
pecab = PeCab(user_dict=user_dict)
pecab.pos("์ €๋Š” ์‚ผ์„ฑ๋””์ง€ํ„ธํ”„๋ผ์ž์—์„œ ์ง€ํŽ ๋ƒ‰์žฅ๊ณ ๋ฅผ ์ƒ€์–ด์š”.")
[('์ €', 'NP'), ('๋Š”', 'JX'), ('์‚ผ์„ฑ๋””์ง€ํ„ธํ”„๋ผ์ž', 'NNG'), ('์—์„œ', 'JKB'), ('์ง€ํŽ ๋ƒ‰์žฅ๊ณ ', 'NNG'), ('๋ฅผ', 'JKO'), ('์ƒ€', 'VV+EP'), ('์–ด์š”', 'EF'), ('.', 'SF')]

6) PeCab(split_compound=bool): devides compound words into smaller pieces.

from pecab import PeCab

pecab = PeCab(split_compound=True)
pecab.morphs("๊ฐ€๋ฒผ์šด ๋ƒ‰์žฅ๊ณ ๋ฅผ ์ƒ€์–ด์š”.")
['๊ฐ€๋ณ', 'แ†ซ', '๋ƒ‰์žฅ', '๊ณ ', '๋ฅผ', '์‚ฌ', 'ใ…ใ…†', '์–ด์š”', '.']

7) ANY_PECAB_FUNCTION(text, drop_space=bool): determines whether spaces are returned or not.

This can be used for all of morphs, pos, nouns. default value of this is True.

from pecab import PeCab

pecab = PeCab()
pecab.pos("ํ† ๋ผ์ •์—์„œ ํฌ๋ฆผ ์šฐ๋™์„ ์‹œ์ผฐ์–ด์š”.")
[('ํ† ๋ผ', 'NNG'), ('์ •', 'NNG'), ('์—์„œ', 'JKB'), ('ํฌ๋ฆผ', 'NNG'), ('์šฐ๋™', 'NNG'), ('์„', 'JKO'), ('์‹œ์ผฐ', 'VV+EP'), ('์–ด์š”', 'EF'), ('.', 'SF')]

pecab.pos("ํ† ๋ผ์ •์—์„œ ํฌ๋ฆผ ์šฐ๋™์„ ์‹œ์ผฐ์–ด์š”.", drop_space=False)
[('ํ† ๋ผ', 'NNG'), ('์ •', 'NNG'), ('์—์„œ', 'JKB'), (' ', 'SP'), ('ํฌ๋ฆผ', 'NNG'), (' ', 'SP'), ('์šฐ๋™', 'NNG'), ('์„', 'JKO'), (' ', 'SP'), ('์‹œ์ผฐ', 'VV+EP'), ('์–ด์š”', 'EF'), ('.', 'SF')]

Implementation Details

In fact, there was a pure python Korean morpheme analyzer before. Its name is Pynori. I've been using Pynori well, and a big thank you to the developer of Pynori. However, Pynori had some problems that needed improvement. So I started making Pecab with its codebase and I focused on solving these problems.

1) 50 ~ 100 times faster loading and less memory usages

When we create Pynori object, it reads matrix and vocabulary files from disk and makes a Trie in runtime. However, this is quite a heavy task. In fact, when I run Pynori for the first time, my computer freezes for almost 10 seconds. So I solved this with the two key ideas: 1) zero-copy memory mapping and 2) double array trie system.

The first key idea was zero copy memory mapping. This allows data in virtual memory (disk) to be used as-is without copying almost to memory. In fact, Pynori takes close to 5 seconds to load mecab_csv.pkl file to memory, and this comes with a very heavy burden. I designed the matrix file to be saved using numpy.memmap and the vocabulary using memmapable pyarrow.Table,

However, there was one problem with designing this. The Trie data structure which was used in Pynori is quite difficult to store in memmap form. In fact, numpy only supports arrays and matrices well, and pyarrow only supports tables in most cases. Therefore, I initially wanted to use a table form instead of a trie. However, Table has a linear time complexity of O(n) to index a particular key, so the searching time could be actually very longer than before. So the second key idea was Double Array Trie (DATrie). DATrie has only two simple integer arrays (base, check) instead of a complex node-based structure unlike general tries, and all keys can be easily retrieved with them. And these two arrays are super easy to make with memmap ! The Double Array Trie can be saved in memmap files easily, so it was one of the best option for me. I wanted to implement everything in Python to facilitate package installation, but unfortunately I couldn't find the DATrie source code implemented in pure python. So I made pure python version of it myself, and you can find the implementation here.

In conclusion, it took almost 50 ~ 100 times less time than before to read these two files, and memory consumption was also significantly reduced because they did not actually reside in memory.

2) User-friendly and pythonic API

Another difficulty I had while using Pynori was the user API. It has a fairly Java-like API and expressions, and to use it I had to pass a lot of parameters when creating the main object. However, I wanted to make it very easy to use, like Mecab, and not require users to parse the output themselves. So I thought about the API and finally decided to have an API similar to KoNLPy that users are already familiar with. I believe that these APIs are much more user-friendly and will make the library more easy to use.

License

Pecab project is licensed under the terms of the Apache License 2.0.

Copyright 2022 Hyunwoong Ko.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

More Repositories

1

transformer

Transformer: PyTorch Implementation of "Attention Is All You Need"
Python
2,188
star
2

kochat

Opensource Korean chatbot framework
Python
443
star
3

openchat

OpenChat: Easy to use opensource chatting framework via neural networks
Python
440
star
4

kss

KSS: Korean String processing Suite
Python
361
star
5

kocrawl

Collection of useful Korean crawlers
Python
86
star
6

nlp-datasets

Curation note of NLP datasets
81
star
7

summarizers

Package for controllable summarization
Python
78
star
8

python-mecab-kor

Yet another python binding for mecab-ko
Python
77
star
9

kobart-transformers

Kobart model on Huggingface transformers
Python
63
star
10

asian-bart

Asian language bart models (En, Ja, Ko, Zh, ECJK)
Python
63
star
11

gpt2-tokenizer-java

Java implementation of GPT2 tokenizer.
Java
62
star
12

bert2bert-summarization

Abstractive summarization using Bert2Bert framework.
Python
31
star
13

megatron-11b

Megatron LM 11B on Huggingface Transformers
Python
27
star
14

pydatrie

Pure python implementation of DARTS (Double ARray Trie System)
Python
22
star
15

retro

An implementation of retrieval-enhanced transformer based on Hugging Face Transformers and FAISS
18
star
16

bigdata-lecture

2020 CBNU summer vacation data campus machine learning lecture materials
Jupyter Notebook
16
star
17

dialobot

Opensource chatbot framework
Python
16
star
18

beyond-lm

Beyond LM: How can language model go forward in the future?
Python
15
star
19

stop-sequencer

Implementation of stop sequencer for Huggingface Transformers
Python
15
star
20

resnext-parallel

Parallel support implementation of "aggregated residual transformations for deep neural networks" using keras
Python
10
star
21

citrus-pest-disease-recognition

Citrus pest disease recognition app based on deep learning
Java
9
star
22

instruct-tuning-example

Instruct tuning example using Hugging Face Transformers and TRL
Python
8
star
23

social-robot-bao

Artificial intelligence robot for children with autism
Java
8
star
24

still-alive

Still alive application decompiled sources
Java
5
star
25

reactive-streams

Asynchronous reactive-streams framework for java
Java
5
star
26

rx-firebase

Android mvvm template with my own implementation of rx-firebase
Java
5
star
27

low-saturation-image-classifier

Classifier whether the image has low saturation or not
Python
5
star
28

lets-kirin

Chatbot app that manage electric home devices
Java
4
star
29

brain-training

Android brain training game app for kakao
Java
4
star
30

titanic

Kaggle : predicting titanic survivors
Jupyter Notebook
4
star
31

movie-recommender

Movie recommendation system using metadata
Jupyter Notebook
4
star
32

hyunwoongko

miscellaneous
2
star
33

strabismus-recognition

Strabismus recognition module based on machine learning
HTML
2
star
34

code-pipeline

Test repository for AWS code-pipeline deployment.
Dockerfile
1
star