• Stars
    star
    356
  • Rank 115,620 (Top 3 %)
  • Language
    C++
  • License
    MIT License
  • Created over 4 years ago
  • Updated 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.

Open in Streamlit Current PyPI packages Test Status PyPI - Downloads Supported Platforms

fugashi

fugashi by Irasutoya

fugashi is a Cython wrapper for MeCab, a Japanese tokenizer and morphological analysis tool. Wheels are provided for Linux, OSX (Intel), and Win64, and UniDic is easy to install.

issueを英語で書く必要はありません。

Check out the interactive demo, see the blog post for background on why fugashi exists and some of the design decisions, or see this guide for a basic introduction to Japanese tokenization.

If you are on a platform for which wheels are not provided, you'll need to install MeCab first. It's recommended you install from source. If you need to build from source on Windows, @chezou's fork is recommended; see issue #44 for an explanation of the problems with the official repo.

Known platforms without wheels:

  • musl-based distros like alpine #77
  • PowerPC
  • Windows 32bit

Usage

from fugashi import Tagger

tagger = Tagger('-Owakati')
text = "麩菓子は、麩を主材料とした日本の菓子。"
tagger.parse(text)
# => '麩 菓子 は 、 麩 を 主材 料 と し た 日本 の 菓子 。'
for word in tagger(text):
    print(word, word.feature.lemma, word.pos, sep='\t')
    # "feature" is the Unidic feature data as a named tuple

Installing a Dictionary

fugashi requires a dictionary. UniDic is recommended, and two easy-to-install versions are provided.

  • unidic-lite, a slightly modified version 2.1.2 of Unidic (from 2013) that's relatively small
  • unidic, the latest UniDic 3.1.0, which is 770MB on disk and requires a separate download step

If you just want to make sure things work you can start with unidic-lite, but for more serious processing unidic is recommended. For production use you'll generally want to generate your own dictionary too; for details see the MeCab documentation.

To get either of these dictionaries, you can install them directly using pip or do the below:

pip install fugashi[unidic-lite]

# The full version of UniDic requires a separate download step
pip install fugashi[unidic]
python -m unidic download

For more information on the different MeCab dictionaries available, see this article.

Dictionary Use

fugashi is written with the assumption you'll use Unidic to process Japanese, but it supports arbitrary dictionaries.

If you're using a dictionary besides Unidic you can use the GenericTagger like this:

from fugashi import GenericTagger
tagger = GenericTagger()

# parse can be used as normal
tagger.parse('something')
# features from the dictionary can be accessed by field numbers
for word in tagger(text):
    print(word.surface, word.feature[0])

You can also create a dictionary wrapper to get feature information as a named tuple.

from fugashi import GenericTagger, create_feature_wrapper
CustomFeatures = create_feature_wrapper('CustomFeatures', 'alpha beta gamma')
tagger = GenericTagger(wrapper=CustomFeatures)
for word in tagger.parseToNodeList(text):
    print(word.surface, word.feature.alpha)

Citation

If you use fugashi in research, it would be appreciated if you cite this paper. You can read it at the ACL Anthology or on Arxiv.

@inproceedings{mccann-2020-fugashi,
    title = "fugashi, a Tool for Tokenizing {J}apanese in Python",
    author = "McCann, Paul",
    booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.nlposs-1.7",
    pages = "44--51",
    abstract = "Recent years have seen an increase in the number of large-scale multilingual NLP projects. However, even in such projects, languages with special processing requirements are often excluded. One such language is Japanese. Japanese is written without spaces, tokenization is non-trivial, and while high quality open source tokenizers exist they can be hard to use and lack English documentation. This paper introduces fugashi, a MeCab wrapper for Python, and gives an introduction to tokenizing Japanese.",
}

Alternatives

If you have a problem with fugashi feel free to open an issue. However, there are some cases where it might be better to use a different library.

  • If you don't want to deal with installing MeCab at all, try SudachiPy.
  • If you need to work with Korean, try pymecab-ko or KoNLPy.

License and Copyright Notice

fugashi is released under the terms of the MIT license. Please copy it far and wide.

fugashi is a wrapper for MeCab, and fugashi wheels include MeCab binaries. MeCab is copyrighted free software by Taku Kudo <[email protected]> and Nippon Telegraph and Telephone Corporation, and is redistributed under the BSD License.

More Repositories

1

cutlet

Japanese to romaji converter in Python
Python
252
star
2

posuto

🏣📮〠 Japanese postal code data.
Python
191
star
3

unidic-py

Unidic packaged for installation via pip.
Python
55
star
4

ndl-crop

Script for cropping photos from the NDL.
Python
37
star
5

unidic-lite

A small version of UniDic for easy pip installs.
Python
30
star
6

showmemore

SHOW ME MORE OF [-----]
Python
28
star
7

ipadic-py

IPAdic packaged for easy use from Python.
Python
24
star
8

palladian-facades

🏛️ Palladian Facade Generator for ProcJam2015
LiveScript
18
star
9

awesome-digital-collections

Publicly accessible digital collections.
17
star
10

multilang-filter

Script for preprocessing multilingual Markdown.
Python
13
star
11

deltos

A magic notepad. δ
LiveScript
13
star
12

gamefaces

Public domain headshots
11
star
13

dupdupdraw

Forthish drawing system with random program generation.
JavaScript
11
star
14

chargen

Random generator taking literature as input.
Python
7
star
15

node-migemo

Japanese search regex generator
LiveScript
7
star
16

philtre

Search objects with a familiar syntax.
LiveScript
6
star
17

ja-tokenizer-benchmark

Compare the speed of various Japanese tokenizers in Python.
Python
6
star
18

jp-ner

[abandoned] Work on generating an NER dataset for Japanese
Python
5
star
19

awesome-gamedev-jp

ゲーム開発に役立つリンク集
3
star
20

shesha

Random generator toolkit
JavaScript
3
star
21

bontan.ls

Bontan is a simple scraper primarily intended for articles.
LiveScript
2
star
22

gutenjuice

Top books from Project Gutenberg, in raw form and extracted.
2
star
23

lua-mecab

Lua wrapper for Mecab Japanese morphological analyzer.
C++
2
star
24

fugashi-streamlit-demo

Streamlit demo for fugashi
Python
2
star
25

jumandic-py

JumanDic packaged for use with PyPI.
Python
2
star
26

bookoff-redirect

Deal with BookOff query parameter nonsense.
HTML
2
star
27

github-tasks.vim

Github task plugin for vim
Vim Script
2
star
28

mecab-packed

[broken/wip] Bundled mecab & unidic for installing via pip.
Shell
1
star
29

language-disruptor

Randomly replace words in Japanese sentences.
Python
1
star
30

poine-tool

POINE関連のツール
Python
1
star
31

mecab-manylinux1-wheel-builder

Build manylinux1 wheels with MeCab installed.
Shell
1
star
32

yuzulabo.works

Yuzu Labo web site
CSS
1
star
33

bontan

Get embed code for a link, using OEmbed as appropriate.
Nim
1
star
34

deltos.vim

A vim plugin for use with Deltos.
Vim Script
1
star
35

kanji

Kanji data package for Python
Python
1
star
36

visidata-conll

CoNLL-U data loader for Visidata.
Python
1
star
37

fugashi-sagemaker-demo

A basic introduction to using fugashi for Japanese tokenization.
Jupyter Notebook
1
star
38

everybayes

Document classification for everyone.
Python
1
star
39

jfmt.lua

Tool for wrapping Japanese text to natural width
Lua
1
star
40

searchy

[discontinued] Simple interactive search for Node
LiveScript
1
star