• Stars
    star
    265
  • Rank 153,770 (Top 4 %)
  • Language Cython
  • License
    Apache License 2.0
  • Created about 9 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Japanese text normalizer for mecab-neologd

neologdn

travis-ci.org pyversion latest version license

neologdn is a Japanese text normalizer for mecab-neologd.

The normalization is based on the neologd's rules: https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja

Contributions are welcome!

NOTE: Installing this module requires C++11 compiler.

Installation

$ pip install neologdn

Usage

import neologdn
neologdn.normalize("ハンカクカナ")
# => 'ハンカクカナ'
neologdn.normalize("全角記号!?@#")
# => '全角記号!?@#'
neologdn.normalize("全角記号例外「・」")
# => '全角記号例外「・」'
neologdn.normalize("長音短縮ウェーーーーイ")
# => '長音短縮ウェーイ'
neologdn.normalize("チルダ削除ウェ~∼∾〜〰~イ")
# => 'チルダ削除ウェイ'
neologdn.normalize("いろんなハイフン˗֊‐‑‒–⁃⁻₋−")
# => 'いろんなハイフン-'
neologdn.normalize("   PRML  副 読 本   ")
# => 'PRML副読本'
neologdn.normalize(" Natural Language Processing ")
# => 'Natural Language Processing'
neologdn.normalize("かわいいいいいいいいい", repeat=6)
# => 'かわいいいいいい'
neologdn.normalize("無駄無駄無駄無駄ァ", repeat=1)
# => '無駄ァ'
neologdn.normalize("1995〜2001年", tilde="normalize")
# => '1995~2001年'
neologdn.normalize("1995~2001年", tilde="normalize_zenkaku")
# => '1995〜2001年'
neologdn.normalize("1995〜2001年", tilde="ignore")  # Don't convert tilde
# => '1995〜2001年'
neologdn.normalize("1995〜2001年", tilde="remove")
# => '19952001年'
neologdn.normalize("1995〜2001年")  # Default parameter
# => '19952001年'

Benchmark

# Sample code from
# https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja#python-written-by-hideaki-t--overlast
import normalize_neologd

%timeit normalize(normalize_neologd.normalize_neologd)
# => 1 loop, best of 3: 18.3 s per loop


import neologdn
%timeit normalize(neologdn.normalize)
# => 1 loop, best of 3: 9.05 s per loop

neologdn is about x2 faster than sample code.

details are described as the below notebook: https://github.com/ikegami-yukino/neologdn/blob/master/benchmark/benchmark.ipynb

License

Apache Software License.

Contribution

Contributions are welcome! See: https://github.com/ikegami-yukino/neologdn/blob/master/.github/CONTRIBUTING.md

More Repositories

1

jaconv

Pure-Python Japanese character interconverter for Hiragana, Katakana, Hankaku, and Zenkaku
Python
289
star
2

dataset-list

lists of text corpus and more (mainly Japanese)
116
star
3

pymlask

Emotion analyzer for Japanese text
Python
111
star
4

oseti

Dictionary based Sentiment Analysis for Japanese
Python
90
star
5

misc

Machine Learning / Randomized Algorithm and more
Jupyter Notebook
35
star
6

mozcpy

Mozc for Python: Kana-Kanji converter
Python
34
star
7

flati

Flatten nested iterable object for Python (Pure-Python implementation)
Python
28
star
8

madoka-python

Memory-efficient Count-Min Sketch Counter (based on Madoka C++ library)
C++
25
star
9

oll-python

Online machine learning algorithms (based on OLL C++ library)
C++
22
star
10

shellinford-python

Wavelet Matrix/Tree succinct data structure for full text search (based on shellinford C++ library)
C++
22
star
11

rakutenma-python

Rakuten MA (Python version)
Python
21
star
12

sengiri

Yet another sentence-level tokenizer for the Japanese text
Python
21
star
13

python-tr

A Pure-Python implementation of the tr algorithm
Python
14
star
14

asa-python

Japanese Argument Structure Analyzer (ASA) client for Python
Python
11
star
15

mecab-as-kkc

Converting Mozc dictionary to MeCab dictionary for Kana-Kanji conversion (KKC)
Python
10
star
16

coding-tips

ど忘れしたときのためのメモ
10
star
17

zunda-python

Zunda: Japanese Enhanced Modality Analyzer client for Python.
Python
10
star
18

jctconv

Rename jctconv -> jaconv. Please use the jaconv
Python
8
star
19

pytypo

English spelling correction
Python
7
star
20

morris_counter

Memory-efficient probabilistic counter namely Morris Counter
Python
5
star
21

udon

Rename udon -> pytypo. Please use the pytypo
Python
4
star
22

neologdn-java

Japanese text normalizer for mecab-neologd
Java
4
star
23

dotfiles

Shell
3
star
24

csj-eval

For evaluating speech recognition system using the Corpus of Spontaneous Japanese (CSJ)
Python
3
star
25

kpy

Keitai (Japanese mobile phone) model name extractor on Python
Python
2
star
26

neologd-diff

Write diff (added/removed entries) of mecab-ipadic-neologd between 2 versions
Python
2
star
27

ikegami-yukino.github.io

Profile de Yukino Ikegami
HTML
1
star
28

yascikit-learn

Yet another scikit-learn
Python
1
star
29

mecab-python-windows

C++
1
star
30

notebooks

Jupyter notebook
Jupyter Notebook
1
star