• Stars
    star
    759
  • Rank 57,459 (Top 2 %)
  • Language
    Go
  • License
    MIT License
  • Created almost 10 years ago
  • Updated 7 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Self-contained Japanese Morphological Analyzer written in pure Go

GoDev Go Release Coverage Status Docker Pulls

Kagome v2

Kagome is an open source Japanese morphological analyzer written in pure golang.

The dictionary/statistical models such as MeCab-IPADIC, UniDic (unidic-mecab) and so on, are able to be embedded in binaries.

Improvements from v1.

  • Dictionaries are maintained in a separate repository, and only the dictionaries you need are embedded in the binary.
  • Brushed up and added several APIs.

Dictionaries

dict source package
MeCab IPADIC mecab-ipadic-2.7.0-20070801 github.com/ikawaha/kagome-dict/ipa
UniDIC unidic-mecab-2.1.2_src github.com/ikawaha/kagome-dict/uni

Note: IPADIC is MeCab's so-called "standard dictionary" and is characterized by its ability to split morphological units more intuitively than UniDIC. In contrast, UniDIC breaks phrases into smaller example sentence units to create metadata for full-text search. For more details, see the wiki.

Experimental Features

dict source package
mecab-ipadic-NEologd mecab-ipadic-neologd github.com/ikawaha/kagome-ipa-neologd
Korean MeCab mecab-ko-dic-2.1.1-20180720 github.com/ikawaha/kagome-dict-ko

Segmentation mode for search

Kagome has segmentation mode for search such as Kuromoji.

  • Normal: Regular segmentation
  • Search: Use a heuristic to do additional segmentation useful for search
  • Extended: Similar to search mode, but also uni-gram unknown words
Untokenized Normal Search Extended
้–ข่ฅฟๅ›ฝ้š›็ฉบๆธฏ ้–ข่ฅฟๅ›ฝ้š›็ฉบๆธฏ ้–ข่ฅฟใ€€ๅ›ฝ้š›ใ€€็ฉบๆธฏ ้–ข่ฅฟใ€€ๅ›ฝ้š›ใ€€็ฉบๆธฏ
ๆ—ฅๆœฌ็ตŒๆธˆๆ–ฐ่ž ๆ—ฅๆœฌ็ตŒๆธˆๆ–ฐ่ž ๆ—ฅๆœฌใ€€็ตŒๆธˆใ€€ๆ–ฐ่ž ๆ—ฅๆœฌใ€€็ตŒๆธˆใ€€ๆ–ฐ่ž
ใ‚ทใƒ‹ใ‚ขใ‚ฝใƒ•ใƒˆใ‚ฆใ‚งใ‚ขใ‚จใƒณใ‚ธใƒ‹ใ‚ข ใ‚ทใƒ‹ใ‚ขใ‚ฝใƒ•ใƒˆใ‚ฆใ‚งใ‚ขใ‚จใƒณใ‚ธใƒ‹ใ‚ข ใ‚ทใƒ‹ใ‚ขใ€€ใ‚ฝใƒ•ใƒˆใ‚ฆใ‚งใ‚ขใ€€ใ‚จใƒณใ‚ธใƒ‹ใ‚ข ใ‚ทใƒ‹ใ‚ขใ€€ใ‚ฝใƒ•ใƒˆใ‚ฆใ‚งใ‚ขใ€€ใ‚จใƒณใ‚ธใƒ‹ใ‚ข
ใƒ‡ใ‚ธใ‚ซใƒกใ‚’่ฒทใฃใŸ ใƒ‡ใ‚ธใ‚ซใƒกใ€€ใ‚’ใ€€่ฒทใฃใ€€ใŸ ใƒ‡ใ‚ธใ‚ซใƒกใ€€ใ‚’ใ€€่ฒทใฃใ€€ใŸ ใƒ‡ใ€€ใ‚ธใ€€ใ‚ซใ€€ใƒกใ€€ใ‚’ใ€€่ฒทใฃใ€€ใŸ

Programming example

package main

import (
	"fmt"
	"strings"

	"github.com/ikawaha/kagome-dict/ipa"
	"github.com/ikawaha/kagome/v2/tokenizer"
)

func main() {
	t, err := tokenizer.New(ipa.Dict(), tokenizer.OmitBosEos())
	if err != nil {
		panic(err)
	}
	// wakati
	fmt.Println("---wakati---")
	seg := t.Wakati("ใ™ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใฎใ†ใก")
	fmt.Println(seg)

	// tokenize
	fmt.Println("---tokenize---")
	tokens := t.Tokenize("ใ™ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใฎใ†ใก")
	for _, token := range tokens {
		features := strings.Join(token.Features(), ",")
		fmt.Printf("%s\t%v\n", token.Surface, features)
	}
}

output:

---wakati---
[ใ™ใ‚‚ใ‚‚ ใ‚‚ ใ‚‚ใ‚‚ ใ‚‚ ใ‚‚ใ‚‚ ใฎ ใ†ใก]
---tokenize---
ใ™ใ‚‚ใ‚‚	ๅ่ฉž,ไธ€่ˆฌ,*,*,*,*,ใ™ใ‚‚ใ‚‚,ใ‚นใƒขใƒข,ใ‚นใƒขใƒข
ใ‚‚	ๅŠฉ่ฉž,ไฟ‚ๅŠฉ่ฉž,*,*,*,*,ใ‚‚,ใƒข,ใƒข
ใ‚‚ใ‚‚	ๅ่ฉž,ไธ€่ˆฌ,*,*,*,*,ใ‚‚ใ‚‚,ใƒขใƒข,ใƒขใƒข
ใ‚‚	ๅŠฉ่ฉž,ไฟ‚ๅŠฉ่ฉž,*,*,*,*,ใ‚‚,ใƒข,ใƒข
ใ‚‚ใ‚‚	ๅ่ฉž,ไธ€่ˆฌ,*,*,*,*,ใ‚‚ใ‚‚,ใƒขใƒข,ใƒขใƒข
ใฎ	ๅŠฉ่ฉž,้€ฃไฝ“ๅŒ–,*,*,*,*,ใฎ,ใƒŽ,ใƒŽ
ใ†ใก	ๅ่ฉž,้ž่‡ช็ซ‹,ๅ‰ฏ่ฉžๅฏ่ƒฝ,*,*,*,ใ†ใก,ใ‚ฆใƒ,ใ‚ฆใƒ

Reference

ๅฎŸ่ทต๏ผšๅฝขๆ…‹็ด ่งฃๆž kagome v2

Commands

Install

  • Go

    go install github.com/ikawaha/kagome/v2@latest
  • Homebrew

    # macOS and Linux (for both AMD64 and ARM64)
    brew install ikawaha/kagome/kagome
  • Docker

  • Manual Install

    • For manual installation, download and extract the appropriate archived file for your OS and architecture from the releases page.
    • Note that the extracted binary must be placed in an accessible directory with execution permission.

Usage

$ kagome -h
Japanese Morphological Analyzer -- github.com/ikawaha/kagome/v2
usage: kagome <command>
The commands are:
   [tokenize] - command line tokenize (*default)
   server - run tokenize server
   lattice - lattice viewer
   sentence - tiny sentence splitter
   version - show version

tokenize [-file input_file] [-dict dic_file] [-userdict user_dic_file] [-sysdict (ipa|uni)] [-simple false] [-mode (normal|search|extended)] [-split] [-json]
  -dict string
    	dict
  -file string
    	input file
  -json
    	outputs in JSON format
  -mode string
    	tokenize mode (normal|search|extended) (default "normal")
  -simple
    	display abbreviated dictionary contents
  -split
    	use tiny sentence splitter
  -sysdict string
    	system dict type (ipa|uni) (default "ipa")
  -udict string
    	user dict

Tokenize command

% # interactive/REPL mode
% kagome
ใ™ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใฎใ†ใก
ใ™ใ‚‚ใ‚‚	ๅ่ฉž,ไธ€่ˆฌ,*,*,*,*,ใ™ใ‚‚ใ‚‚,ใ‚นใƒขใƒข,ใ‚นใƒขใƒข
ใ‚‚	ๅŠฉ่ฉž,ไฟ‚ๅŠฉ่ฉž,*,*,*,*,ใ‚‚,ใƒข,ใƒข
ใ‚‚ใ‚‚	ๅ่ฉž,ไธ€่ˆฌ,*,*,*,*,ใ‚‚ใ‚‚,ใƒขใƒข,ใƒขใƒข
ใ‚‚	ๅŠฉ่ฉž,ไฟ‚ๅŠฉ่ฉž,*,*,*,*,ใ‚‚,ใƒข,ใƒข
ใ‚‚ใ‚‚	ๅ่ฉž,ไธ€่ˆฌ,*,*,*,*,ใ‚‚ใ‚‚,ใƒขใƒข,ใƒขใƒข
ใฎ	ๅŠฉ่ฉž,้€ฃไฝ“ๅŒ–,*,*,*,*,ใฎ,ใƒŽ,ใƒŽ
ใ†ใก	ๅ่ฉž,้ž่‡ช็ซ‹,ๅ‰ฏ่ฉžๅฏ่ƒฝ,*,*,*,ใ†ใก,ใ‚ฆใƒ,ใ‚ฆใƒ
EOS
% # piped standard input
echo "ใ™ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใฎใ†ใก" | kagome
ใ™ใ‚‚ใ‚‚  ๅ่ฉž,ไธ€่ˆฌ,*,*,*,*,ใ™ใ‚‚ใ‚‚,ใ‚นใƒขใƒข,ใ‚นใƒขใƒข
ใ‚‚      ๅŠฉ่ฉž,ไฟ‚ๅŠฉ่ฉž,*,*,*,*,ใ‚‚,ใƒข,ใƒข
ใ‚‚ใ‚‚    ๅ่ฉž,ไธ€่ˆฌ,*,*,*,*,ใ‚‚ใ‚‚,ใƒขใƒข,ใƒขใƒข
ใ‚‚      ๅŠฉ่ฉž,ไฟ‚ๅŠฉ่ฉž,*,*,*,*,ใ‚‚,ใƒข,ใƒข
ใ‚‚ใ‚‚    ๅ่ฉž,ไธ€่ˆฌ,*,*,*,*,ใ‚‚ใ‚‚,ใƒขใƒข,ใƒขใƒข
ใฎ      ๅŠฉ่ฉž,้€ฃไฝ“ๅŒ–,*,*,*,*,ใฎ,ใƒŽ,ใƒŽ
ใ†ใก    ๅ่ฉž,้ž่‡ช็ซ‹,ๅ‰ฏ่ฉžๅฏ่ƒฝ,*,*,*,ใ†ใก,ใ‚ฆใƒ,ใ‚ฆใƒ
EOS
% # JSON output
% echo "็Œซ" | kagome -json | jq .
[
  {
    "id": 286994,
    "start": 0,
    "end": 1,
    "surface": "็Œซ",
    "class": "KNOWN",
    "pos": [
      "ๅ่ฉž",
      "ไธ€่ˆฌ",
      "*",
      "*"
    ],
    "base_form": "็Œซ",
    "reading": "ใƒใ‚ณ",
    "pronunciation": "ใƒใ‚ณ",
    "features": [
      "ๅ่ฉž",
      "ไธ€่ˆฌ",
      "*",
      "*",
      "*",
      "*",
      "็Œซ",
      "ใƒใ‚ณ",
      "ใƒใ‚ณ"
    ]
  }
]
echo "็งใฏใฏใซใ‚ใ‚ˆใ‚ใ‚ใ‚ใ‚“ใ‚ใ‚“" | kagome -json | jq -r '.[].pronunciation'
ใƒฏใ‚ฟใ‚ท
ใƒฏ
ใƒใƒ‹ใƒฏ
ใƒจ
ใƒฏ
ใƒฏ
ใƒฏใƒณใƒฏใƒณ

Server command

API

Start a server and try to access the "/tokenize" endpoint.

% kagome server &
% curl -XPUT localhost:6060/tokenize -d'{"sentence":"ใ™ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใฎใ†ใก", "mode":"normal"}' | jq .

Web App

webapp

GitHub Page: https://ikawaha.github.io/kagome/

Start a server and access http://localhost:6060. (To draw a lattice, demo application uses graphviz . You need graphviz installed.)

% kagome server &

Lattice command

A debug tool of tokenize process outputs a lattice in graphviz dot format.

% kagome lattice ็งใฏ้ฐป | dot -Tpng -o lattice.png

lattice

Docker

Docker

# Compatible architectures: AMD64, Arm64, Arm32 (Arm v5, v6 and v7)
docker pull ikawaha/kagome:latest

# Alternatively, you can pull from GitHub Container Registry
docker pull ghcr.io/ikawaha/kagome:latest
# Interactive/REPL mode
docker run --rm -it ikawaha/kagome:latest

# If pulling from GitHub Container Registry
docker run --rm -it ghcr.io/ikawaha/kagome:latest
# Server mode (http://localhost:6060)
docker run --rm -p 6060:6060 ikawaha/kagome:latest server

# If pulling from GitHub Container Registry
docker run --rm -p 6060:6060 ghcr.io/ikawaha/kagome:latest server

Building to WebAssembly

You can see how kagome wasm works in demo site. The source code can be found in ./sample/wasm.

Licence

MIT