• Stars
    star
    213
  • Rank 185,410 (Top 4 %)
  • Language
    Python
  • License
    MIT License
  • Created over 6 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

๐ŸŒฟ An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.

๐ŸŒฟ Konoha: Simple wrapper of Japanese Tokenizers

Open In Colab

GitHub stars

Downloads Downloads Downloads

Build Status Documentation Status Python PyPI GitHub Issues GitHub Pull Requests

Konoha is a Python library for providing easy-to-use integrated interface of various Japanese tokenizers, which enables you to switch a tokenizer and boost your pre-processing.

Supported tokenizers

Also, konoha provides rule-based tokenizers (whitespace, character) and a rule-based sentence splitter.

Quick Start with Docker

Simply run followings on your computer:

docker run --rm -p 8000:8000 -t himkt/konoha  # from DockerHub

Or you can build image on your machine:

git clone https://github.com/himkt/konoha  # download konoha
cd konoha && docker-compose up --build  # build and launch container

Tokenization is done by posting a json object to localhost:8000/api/v1/tokenize. You can also batch tokenize by passing texts: ["๏ผ‘ใค็›ฎใฎๅ…ฅๅŠ›", "๏ผ’ใค็›ฎใฎๅ…ฅๅŠ›"] to localhost:8000/api/v1/batch_tokenize.

(API documentation is available on localhost:8000/redoc, you can check it using your web browser)

Send a request using curl on your terminal. Note that a path to an endpoint is changed in v4.6.4. Please check our release note (https://github.com/himkt/konoha/releases/tag/v4.6.4).

$ curl localhost:8000/api/v1/tokenize -X POST -H "Content-Type: application/json" \
    -d '{"tokenizer": "mecab", "text": "ใ“ใ‚Œใฏใƒšใƒณใงใ™"}'

{
  "tokens": [
    [
      {
        "surface": "ใ“ใ‚Œ",
        "part_of_speech": "ๅ่ฉž"
      },
      {
        "surface": "ใฏ",
        "part_of_speech": "ๅŠฉ่ฉž"
      },
      {
        "surface": "ใƒšใƒณ",
        "part_of_speech": "ๅ่ฉž"
      },
      {
        "surface": "ใงใ™",
        "part_of_speech": "ๅŠฉๅ‹•่ฉž"
      }
    ]
  ]
}

Installation

I recommend you to install konoha by pip install 'konoha[all]'.

  • Install konoha with a specific tokenizer: pip install 'konoha[(tokenizer_name)].
  • Install konoha with a specific tokenizer and remote file support: pip install 'konoha[(tokenizer_name),remote]'

If you want to install konoha with a tokenizer, please install konoha with a specific tokenizer (e.g. konoha[mecab], konoha[sudachi], ...etc) or install tokenizers individually.

Example

Word level tokenization

from konoha import WordTokenizer

sentence = '่‡ช็„ถ่จ€่ชžๅ‡ฆ็†ใ‚’ๅ‹‰ๅผทใ—ใฆใ„ใพใ™'

tokenizer = WordTokenizer('MeCab')
print(tokenizer.tokenize(sentence))
# => [่‡ช็„ถ, ่จ€่ชž, ๅ‡ฆ็†, ใ‚’, ๅ‹‰ๅผท, ใ—, ใฆ, ใ„, ใพใ™]

tokenizer = WordTokenizer('Sentencepiece', model_path="data/model.spm")
print(tokenizer.tokenize(sentence))
# => [โ–, ่‡ช็„ถ, ่จ€่ชž, ๅ‡ฆ็†, ใ‚’, ๅ‹‰ๅผท, ใ—, ใฆใ„ใพใ™]

For more detail, please see the example/ directory.

Remote files

Konoha supports dictionary and model on cloud storage (currently supports Amazon S3). It requires installing konoha with the remote option, see Installation.

# download user dictionary from S3
word_tokenizer = WordTokenizer("mecab", user_dictionary_path="s3://abc/xxx.dic")
print(word_tokenizer.tokenize(sentence))

# download system dictionary from S3
word_tokenizer = WordTokenizer("mecab", system_dictionary_path="s3://abc/yyy")
print(word_tokenizer.tokenize(sentence))

# download model file from S3
word_tokenizer = WordTokenizer("sentencepiece", model_path="s3://abc/zzz.model")
print(word_tokenizer.tokenize(sentence))

Sentence level tokenization

from konoha import SentenceTokenizer

sentence = "็งใฏ็Œซใ ใ€‚ๅๅ‰ใชใ‚“ใฆใ‚‚ใฎใฏใชใ„ใ€‚ใ ใŒ๏ผŒใ€Œใ‹ใ‚ใ„ใ„ใ€‚ใใ‚Œใงๅๅˆ†ใ ใ‚ใ†ใ€ใ€‚"

tokenizer = SentenceTokenizer()
print(tokenizer.tokenize(sentence))
# => ['็งใฏ็Œซใ ใ€‚', 'ๅๅ‰ใชใ‚“ใฆใ‚‚ใฎใฏใชใ„ใ€‚', 'ใ ใŒ๏ผŒใ€Œใ‹ใ‚ใ„ใ„ใ€‚ใใ‚Œใงๅๅˆ†ใ ใ‚ใ†ใ€ใ€‚']

You can change symbols for a sentence splitter and bracket expression.

  1. sentence splitter
sentence = "็งใฏ็Œซใ ใ€‚ๅๅ‰ใชใ‚“ใฆใ‚‚ใฎใฏใชใ„๏ผŽใ ใŒ๏ผŒใ€Œใ‹ใ‚ใ„ใ„ใ€‚ใใ‚Œใงๅๅˆ†ใ ใ‚ใ†ใ€ใ€‚"

tokenizer = SentenceTokenizer(period="๏ผŽ")
print(tokenizer.tokenize(sentence))
# => ['็งใฏ็Œซใ ใ€‚ๅๅ‰ใชใ‚“ใฆใ‚‚ใฎใฏใชใ„๏ผŽ', 'ใ ใŒ๏ผŒใ€Œใ‹ใ‚ใ„ใ„ใ€‚ใใ‚Œใงๅๅˆ†ใ ใ‚ใ†ใ€ใ€‚']
  1. bracket expression
sentence = "็งใฏ็Œซใ ใ€‚ๅๅ‰ใชใ‚“ใฆใ‚‚ใฎใฏใชใ„ใ€‚ใ ใŒ๏ผŒใ€Žใ‹ใ‚ใ„ใ„ใ€‚ใใ‚Œใงๅๅˆ†ใ ใ‚ใ†ใ€ใ€‚"

tokenizer = SentenceTokenizer(
    patterns=SentenceTokenizer.PATTERNS + [re.compile(r"ใ€Ž.*?ใ€")],
)
print(tokenizer.tokenize(sentence))
# => ['็งใฏ็Œซใ ใ€‚', 'ๅๅ‰ใชใ‚“ใฆใ‚‚ใฎใฏใชใ„ใ€‚', 'ใ ใŒ๏ผŒใ€Žใ‹ใ‚ใ„ใ„ใ€‚ใใ‚Œใงๅๅˆ†ใ ใ‚ใ†ใ€ใ€‚']

Test

python -m pytest

Article

Acknowledgement

Sentencepiece model used in test is provided by @yoheikikuta. Thanks!

More Repositories

1

awesome-bert-japanese

๐Ÿ“ A list of pre-trained BERT models for Japanese with word/subword tokenization + vocabulary construction algorithm information
129
star
2

pyner

๐ŸŒˆ Implementation of Neural Network based Named Entity Recognizer (Lample+, 2016) using Chainer.
Python
45
star
3

allennlp-optuna

โšก๏ธ AllenNLP plugin for adding subcommands to use Optuna, making hyperparameter optimization easy
Python
32
star
4

dotfiles

๐Ÿ”ฏ A collection of my rc files (tmux, neovim, zsh, fish, poetry, git, ...etc) and utilities that make everyday coding fun!
Shell
24
star
5

nlp-100knock

โšพ 2015 nlp-100knock (่จ€่ชžๅ‡ฆ็†100ๆœฌใƒŽใƒƒใ‚ฏ): Archive of my solutions
Jupyter Notebook
22
star
6

optuna-allennlp

๐Ÿš€ A demonstration of hyperparameter optimization using Optuna for models implemented with AllenNLP.
Jupyter Notebook
16
star
7

allennlp-NER

โ˜ฏ๏ธ AllenNLP training configurations for promising models on Named Entity Recognition. (BiLSTM-CRF, BiLSTM-CNN-CRF, BERT, BERT-CRF)
Python
15
star
8

interest

๐Ÿ‘€ Interest: Organizing papers+materials which you are interested in. Serverless application powered by GitHub pages + Google Spreadsheet.
TypeScript
14
star
9

kirinoha

Kirinoha-ๆกใฎ่‘‰: A class search system for students in University of Tsukuba
Ruby
9
star
10

sist02-cli

command line tool to make usr of sist02
Ruby
9
star
11

polaris

simple sentimental analyzer based on polarity words dictionary
Ruby
8
star
12

docstring.nvim

โœ’๏ธ Python docstring generating tool for NeoVim
Python
8
star
13

sist02

simple library to make citation more easily
Ruby
6
star
14

bulletin_bot

A notification tool for students in University of Tsukuba
Python
5
star
15

ndl

gem to make use of NLP API
Ruby
5
star
16

syllabus

Ruby
5
star
17

survey

3
star
18

keras_example

Reimplementations
Python
3
star
19

cargo-atcoder-vscode

๐Ÿ”ง Visual Studio Code extension for AtCoder, especially for Rustacean, based on cargo-atcoder
TypeScript
3
star
20

trackyou

Python
2
star
21

algorithm-rs

๐Ÿ› ๏ธ Data structure and algorithm written in Rust
Rust
2
star
22

problems.2022

Rust
2
star
23

optuna-test-rtds

Python
2
star
24

pagerank

A simple library for calculating PageRank
Ruby
2
star
25

fastapi-todos

Python
2
star
26

studynow

Ruby
2
star
27

twintter

twintter provides you a method to search a subject and broadcast it
Ruby
2
star
28

dobato

Python
2
star
29

optuna-allennlp-reproduction

A repo for investigating a reproductivity problem in Optuna x AllenNLP
Python
1
star
30

atcoder-accept-count

Go
1
star
31

dobato-go

Go
1
star
32

mypy-0790-internal-error

Python
1
star
33

commonlitreadabilityprize

Jupyter Notebook
1
star
34

rblearn

library for nlp with ruby
Ruby
1
star
35

acl-anthology

Python
1
star
36

sist02-web

sist02 web client
Ruby
1
star
37

notebooks

My notebooks for learning various models in machine learning
Jupyter Notebook
1
star
38

problems.2019

็ซถๆŠ€ใƒ—ใƒญใ‚ฐใƒฉใƒŸใƒณใ‚ฐใจใ‹
C++
1
star
39

setuppy2pyprojecttoml

Python
1
star
40

deeplearning4nlp

Python
1
star
41

omoide

๐Ÿค– Browse and clean tweets in CLI
Rust
1
star
42

tweet-cleaner

Ruby
1
star
43

kdb2tsv

kdbใงใ‚จใ‚ฏใ‚นใƒใƒผใƒˆใ—ใŸExcelใƒ•ใ‚กใ‚คใƒซใ‚’ๅ…ƒใซtsvใƒ•ใ‚กใ‚คใƒซใ‚’ไฝœๆˆใ—ใพใ™
Ruby
1
star
44

optuna-e2e

Python
1
star
45

frontend-template

TypeScript
1
star
46

pysen-docs

Python
1
star