• Stars
    star
    355
  • Rank 116,285 (Top 3 %)
  • Language
    C++
  • License
    Apache License 2.0
  • Created over 7 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Juman++ (a Morphological Analyzer Toolkit)

What is Juman++

A new morphological analyser that considers semantic plausibility of word sequences by using a recurrent neural network language model (RNNLM). Version 2 has better accuracy and greatly (>250x) improved analysis speed than the original Juman++.

Build Status

Installation

System Requirements

  • OS: Linux, MacOS X or Windows.
  • Compiler: C++14 compatible
    • For example gcc 5.1+, clang 3.4+, MSVC 2017
    • We test on GCC and clang on Linux/MacOS, mingw64-gcc and MSVC2017 on Windows
  • CMake v3.1 or later
  • For Ubuntu22.04, you need to install additional packages as follows: sudo apt install libprotobuf-dev protobuf-compiler

Read this document for CentOS and RHEL derivatives or non-CMake alternatives.

Building from a package

Download the package from Releases

Important: The download should be around 300 MB. If it is not you have probably downloaded a source snapshot which does not contain a model.

$ tar xf jumanpp-<version>.tar.xz # decompress the package
$ cd jumanpp-<version> # move into the directory
$ mkdir bld # make a subdirectory for build
$ cd bld
$ cmake .. \
  -DCMAKE_BUILD_TYPE=Release \ # you want to do this for performance
  -DCMAKE_INSTALL_PREFIX=<prefix> # where to install Juman++
$ make install -j<parallelism>

Building from git

Important: Only the package distribution contains a pretrained model and can be used for analysis. The current git version is not compatible with the models of 2.0-rc1 and 2.0-rc2.

$ mkdir cmake-build-dir # CMake does not support in-source builds
$ cd cmake-build-dir
$ cmake ..
$ make # -j

Usage

Quick start

% echo "魅力がたっぷりと詰まっている" | jumanpp
魅力 みりょく 魅力 名詞 6 普通名詞 1 * 0 * 0 "代表表記:魅力/みりょく カテゴリ:抽象物"
が が が 助詞 9 格助詞 1 * 0 * 0 NIL
たっぷり たっぷり たっぷり 副詞 8 * 0 * 0 * 0 "自動認識"
と と と 助詞 9 格助詞 1 * 0 * 0 NIL
詰まって つまって 詰まる 動詞 2 * 0 子音動詞ラ行 10 タ系連用テ形 14 "代表表記:詰まる/つまる ドメイン:料理・食事 自他動詞:他:詰める/つめる"
いる いる いる 接尾辞 14 動詞性接尾辞 7 母音動詞 1 基本形 2 "代表表記:いる/いる"
EOS

Main options

usage: jumanpp [options] 
  -s, --specifics              lattice format output (unsigned int [=5])
  --beam <int>                 set local beam width used in analysis (unsigned int [=5])
  -v, --version                print version
  -h, --help                   print this message
  --model <file>               specify a model location

Use --help to see more options.

Input

JUMAN++ can handle only utf-8 encoded text as an input. Lines beginning with # will be interpreted as comments.

Training Jumandic Model

A set of scripts for training Jumandic model is available in this repository. It is possible to modify the system dictionary to add other entries to the trained model.

Attention: You need to have access to Mainichi Shinbun for Year 1995 to be able to use Kyoto Univeristy corpus for training.

Other

DEMO

You can play around our web demo which displays a subset of the whole lattice. The demo still uses v1 but, it will be updated to v2 soon.

Extracting diffs caused by beam configurations

You can see sentences in which two different beam configurations produce different analyses. A src/jumandic/jpp_jumandic_pathdiff binary (source) (relative to a compilation root) does it. The only Jumandic-specific thing here is the usage of code-generated linear model inference.

Use the binary as jpp_jumandic_pathdiff <model> <input> > <output>.

Outputs would be in the partial annotation format with a full beam results being the actual tags and trimmed beam results being written as comments.

Example:

# scores: -0.602687 -1.20004
# 子がい        pos:名詞        subpos:普通名詞 <------- trimmed beam result
# S-ID:w201007-0080605751-6 COUNT:2
熊本選抜にはマリノス、アントラーズのユースに行く
        子      pos:名詞        subpos:普通名詞 <------- full beam result
        が      pos:助詞        subpos:格助詞
        い      baseform:いる   conjtype:母音動詞       pos:動詞        conjform:基本連用形
ます

Partial Annotation Tool

We also have a partial annotation tool. Please see https://github.com/eiennohito/nlp-tools-demo for details.

Performance Notes

To get the best performance, you need to build with extended instruction sets. If you are planning to use Juman++ only locally, specify -DCMAKE_CXX_FLAGS="-march=native".

Works best on Intel Haswell and newer processors (because of FMA and BMI instruction set extensions).

Using Juman++ to create your own Morphological Analyzer

Juman++ is a general tool. It does not depend on Jumandic or Japanese Language (albeit there are some Japanese-specific functionality). See this tutorial project which shows how to implement a something similar to a T9 text input for the case when there are no word boundaries in the input text.

Publications and Slides

  • About the model itself: Morphological Analysis for Unsegmented Languages using Recurrent Neural Network Language Model. Hajime Morita, Daisuke Kawahara, Sadao Kurohashi. EMNLP 2015 link, bibtex.

  • V2 Improvments: Juman++ v2: A Practical and Modern Morphological Analyzer. Arseny Tolmachev and Kurohashi Sadao. The Proceedings of the Twenty-fourth Annual Meeting of the Association for Natural Language Processing. March 2018, Okayama, Japan. (pdf, slides)

  • Morphological Analysis Workshop in ANLP2018 Slides: 形態素解析システムJuman++. 河原 大輔, Arseny Tolmachev. (in Japanese) slides.

  • Juman++: A Morphological Analysis Toolkit for Scriptio Continua. Arseny Tolmachev, Daisuke Kawahara and Sadao Kurohashi. EMNLP 2018, Brussels. pdf, poster, bibtex.

  • Design and Structure of The Juman++ Morphological Analyzer Toolkit. Arseny Tolmachev, Daisuke Kawahara, Sadao Kurohashi. Journal of Natural Language Processing, (paper, bibtex).

If you use Juman++ V1 in academic setting, then please cite the first work (EMNLP2015). If you use Juman++ V2, then please cite both the first and the fourth (EMNLP2018) papers.

Authors

  • Arseny Tolmachev <arseny at kotonoha.ws>
  • Hajime Morita <hmorita at nlp.ist.i.kyoto-u.ac.jp>
  • Daisuke Kawahara <dk at i.kyoto-u.ac.jp>
  • Sadao Kurohashi <kuro at i.kyoto-u.ac.jp>

Acknowledgement

The list of all libraries used by JUMAN++ is here.

Notice

This is a branch for the Juman++ rewrite. The original version lives in the legacy branch.

More Repositories

1

kwja

An integrated Japanese analyzer based on foundation models
Python
112
star
2

pyknp

A Python Module for JUMAN++/KNP
Python
86
star
3

KWDLC

Kyoto University Web Document Leads Corpus
Python
72
star
4

KyotoCorpus

Kyoto University Text Corpus
Perl
53
star
5

bert-based-faqir

Python
47
star
6

ja-vicuna-qa-benchmark

Python
28
star
7

rhoknp

Yet another Python binding for Juman++/KNP/KWJA
Python
26
star
8

knp

A Japanese Parser
C
26
star
9

JMRD

Japanese Movie Recommendation Dialogue dataset
25
star
10

steganography-with-masked-lm

Implementation of "Frustratingly Easy Edit-based Linguistic Steganography with a Masked Language Model"
Python
24
star
11

bertknp

A Japanese dependency parser based on BERT
Python
20
star
12

AnnotatedFKCCorpus

Annotated Fuman Kaitori Center Corpus
Python
17
star
13

text-cleaning

A powerful text cleaner for Japanese web texts
Python
12
star
14

WikipediaAnnotatedCorpus

Python
11
star
15

kyoto-reader

A processor for KyotoCorpus, KWDLC, and AnnotatedFKCCorpus
Python
10
star
16

pyknp-eventgraph

Python
9
star
17

VISA

An ambiguous subtitles dataset for visual scene-aware machine translation
9
star
18

JKUSea

Utilitary tool aligning sentences of texts written in 2 different languages.
Perl
7
star
19

Winograd-Schema-Challenge-Ja

Japanese Translation of Winograd Schema Challenge
Python
6
star
20

juman

C
6
star
21

python-textformatting

Python
6
star
22

KyotoCorpusAnnotationTool

An annotation tool for the Kyoto University Corpus
JavaScript
5
star
23

TSUBAKI

Perl
5
star
24

jumanpp-jumandic

Scripts for training Jumandic Juman++ model
Makefile
5
star
25

WWW2sf

Perl
4
star
26

covost2NativeJa

Corpus for speech-to-text translation in Japanese-English based on CoVoST 2
3
star
27

ChatCollectionFramework

Python
3
star
28

speechBSD

An extension of the BSD corpus with audio and speaker attribute information
3
star
29

dockerfile-jumanpp-knp

Dockerfile for Juman++, KNP, and KWJA
Dockerfile
3
star
30

ishi

Ishi: A volition classifier for Japanese
Python
2
star
31

video-helpful-MMT

2
star
32

jumandic-grammar

grammar files and related scripts
Python
1
star
33

normtime

Python
1
star
34

JumanDIC

Perl
1
star