Kagome is an open source Japanese morphological analyzer written in pure golang.
The dictionary/statistical models such as MeCab-IPADIC, UniDic (unidic-mecab) and so on, are able to be embedded in binaries.
Improvements from v1.
- Dictionaries are maintained in a separate repository, and only the dictionaries you need are embedded in the binary.
- Brushed up and added several APIs.
dict | source | package |
---|---|---|
MeCab IPADIC | mecab-ipadic-2.7.0-20070801 | github.com/ikawaha/kagome-dict/ipa |
UniDIC | unidic-mecab-2.1.2_src | github.com/ikawaha/kagome-dict/uni |
Note: IPADIC is MeCab's so-called "standard dictionary" and is characterized by its ability to split morphological units more intuitively than UniDIC. In contrast, UniDIC breaks phrases into smaller example sentence units to create metadata for full-text search. For more details, see the wiki.
Experimental Features
dict | source | package |
---|---|---|
mecab-ipadic-NEologd | mecab-ipadic-neologd | github.com/ikawaha/kagome-ipa-neologd |
Korean MeCab | mecab-ko-dic-2.1.1-20180720 | github.com/ikawaha/kagome-dict-ko |
Kagome has segmentation mode for search such as Kuromoji.
- Normal: Regular segmentation
- Search: Use a heuristic to do additional segmentation useful for search
- Extended: Similar to search mode, but also uni-gram unknown words
Untokenized | Normal | Search | Extended |
---|---|---|---|
้ข่ฅฟๅฝ้็ฉบๆธฏ | ้ข่ฅฟๅฝ้็ฉบๆธฏ | ้ข่ฅฟใๅฝ้ใ็ฉบๆธฏ | ้ข่ฅฟใๅฝ้ใ็ฉบๆธฏ |
ๆฅๆฌ็ตๆธๆฐ่ | ๆฅๆฌ็ตๆธๆฐ่ | ๆฅๆฌใ็ตๆธใๆฐ่ | ๆฅๆฌใ็ตๆธใๆฐ่ |
ใทใใขใฝใใใฆใงใขใจใณใธใใข | ใทใใขใฝใใใฆใงใขใจใณใธใใข | ใทใใขใใฝใใใฆใงใขใใจใณใธใใข | ใทใใขใใฝใใใฆใงใขใใจใณใธใใข |
ใใธใซใกใ่ฒทใฃใ | ใใธใซใกใใใ่ฒทใฃใใ | ใใธใซใกใใใ่ฒทใฃใใ | ใใใธใใซใใกใใใ่ฒทใฃใใ |
package main
import (
"fmt"
"strings"
"github.com/ikawaha/kagome-dict/ipa"
"github.com/ikawaha/kagome/v2/tokenizer"
)
func main() {
t, err := tokenizer.New(ipa.Dict(), tokenizer.OmitBosEos())
if err != nil {
panic(err)
}
// wakati
fmt.Println("---wakati---")
seg := t.Wakati("ใใใใใใใใใใฎใใก")
fmt.Println(seg)
// tokenize
fmt.Println("---tokenize---")
tokens := t.Tokenize("ใใใใใใใใใใฎใใก")
for _, token := range tokens {
features := strings.Join(token.Features(), ",")
fmt.Printf("%s\t%v\n", token.Surface, features)
}
}
output:
---wakati---
[ใใใ ใ ใใ ใ ใใ ใฎ ใใก]
---tokenize---
ใใใ ๅ่ฉ,ไธ่ฌ,*,*,*,*,ใใใ,ในใขใข,ในใขใข
ใ ๅฉ่ฉ,ไฟๅฉ่ฉ,*,*,*,*,ใ,ใข,ใข
ใใ ๅ่ฉ,ไธ่ฌ,*,*,*,*,ใใ,ใขใข,ใขใข
ใ ๅฉ่ฉ,ไฟๅฉ่ฉ,*,*,*,*,ใ,ใข,ใข
ใใ ๅ่ฉ,ไธ่ฌ,*,*,*,*,ใใ,ใขใข,ใขใข
ใฎ ๅฉ่ฉ,้ฃไฝๅ,*,*,*,*,ใฎ,ใ,ใ
ใใก ๅ่ฉ,้่ช็ซ,ๅฏ่ฉๅฏ่ฝ,*,*,*,ใใก,ใฆใ,ใฆใ
- For more examples, see the sample directory.
-
Go
go install github.com/ikawaha/kagome/v2@latest
-
Homebrew
# macOS and Linux (for both AMD64 and ARM64) brew install ikawaha/kagome/kagome
-
Docker
- See the Docker section below
-
Manual Install
- For manual installation, download and extract the appropriate archived file for your OS and architecture from the releases page.
- Note that the extracted binary must be placed in an accessible directory with execution permission.
$ kagome -h
Japanese Morphological Analyzer -- github.com/ikawaha/kagome/v2
usage: kagome <command>
The commands are:
[tokenize] - command line tokenize (*default)
server - run tokenize server
lattice - lattice viewer
sentence - tiny sentence splitter
version - show version
tokenize [-file input_file] [-dict dic_file] [-userdict user_dic_file] [-sysdict (ipa|uni)] [-simple false] [-mode (normal|search|extended)] [-split] [-json]
-dict string
dict
-file string
input file
-json
outputs in JSON format
-mode string
tokenize mode (normal|search|extended) (default "normal")
-simple
display abbreviated dictionary contents
-split
use tiny sentence splitter
-sysdict string
system dict type (ipa|uni) (default "ipa")
-udict string
user dict
% # interactive/REPL mode
% kagome
ใใใใใใใใใใฎใใก
ใใใ ๅ่ฉ,ไธ่ฌ,*,*,*,*,ใใใ,ในใขใข,ในใขใข
ใ ๅฉ่ฉ,ไฟๅฉ่ฉ,*,*,*,*,ใ,ใข,ใข
ใใ ๅ่ฉ,ไธ่ฌ,*,*,*,*,ใใ,ใขใข,ใขใข
ใ ๅฉ่ฉ,ไฟๅฉ่ฉ,*,*,*,*,ใ,ใข,ใข
ใใ ๅ่ฉ,ไธ่ฌ,*,*,*,*,ใใ,ใขใข,ใขใข
ใฎ ๅฉ่ฉ,้ฃไฝๅ,*,*,*,*,ใฎ,ใ,ใ
ใใก ๅ่ฉ,้่ช็ซ,ๅฏ่ฉๅฏ่ฝ,*,*,*,ใใก,ใฆใ,ใฆใ
EOS
% # piped standard input
echo "ใใใใใใใใใใฎใใก" | kagome
ใใใ ๅ่ฉ,ไธ่ฌ,*,*,*,*,ใใใ,ในใขใข,ในใขใข
ใ ๅฉ่ฉ,ไฟๅฉ่ฉ,*,*,*,*,ใ,ใข,ใข
ใใ ๅ่ฉ,ไธ่ฌ,*,*,*,*,ใใ,ใขใข,ใขใข
ใ ๅฉ่ฉ,ไฟๅฉ่ฉ,*,*,*,*,ใ,ใข,ใข
ใใ ๅ่ฉ,ไธ่ฌ,*,*,*,*,ใใ,ใขใข,ใขใข
ใฎ ๅฉ่ฉ,้ฃไฝๅ,*,*,*,*,ใฎ,ใ,ใ
ใใก ๅ่ฉ,้่ช็ซ,ๅฏ่ฉๅฏ่ฝ,*,*,*,ใใก,ใฆใ,ใฆใ
EOS
% # JSON output
% echo "็ซ" | kagome -json | jq .
[
{
"id": 286994,
"start": 0,
"end": 1,
"surface": "็ซ",
"class": "KNOWN",
"pos": [
"ๅ่ฉ",
"ไธ่ฌ",
"*",
"*"
],
"base_form": "็ซ",
"reading": "ใใณ",
"pronunciation": "ใใณ",
"features": [
"ๅ่ฉ",
"ไธ่ฌ",
"*",
"*",
"*",
"*",
"็ซ",
"ใใณ",
"ใใณ"
]
}
]
echo "็งใฏใฏใซใใใใใใใใ" | kagome -json | jq -r '.[].pronunciation'
ใฏใฟใท
ใฏ
ใใใฏ
ใจ
ใฏ
ใฏ
ใฏใณใฏใณ
API
Start a server and try to access the "/tokenize" endpoint.
% kagome server &
% curl -XPUT localhost:6060/tokenize -d'{"sentence":"ใใใใใใใใใใฎใใก", "mode":"normal"}' | jq .
Web App
GitHub Page: https://ikawaha.github.io/kagome/
Start a server and access http://localhost:6060
.
(To draw a lattice, demo application uses graphviz . You need graphviz installed.)
% kagome server &
A debug tool of tokenize process outputs a lattice in graphviz dot format.
% kagome lattice ็งใฏ้ฐป | dot -Tpng -o lattice.png
# Compatible architectures: AMD64, Arm64, Arm32 (Arm v5, v6 and v7)
docker pull ikawaha/kagome:latest
# Alternatively, you can pull from GitHub Container Registry
docker pull ghcr.io/ikawaha/kagome:latest
# Interactive/REPL mode
docker run --rm -it ikawaha/kagome:latest
# If pulling from GitHub Container Registry
docker run --rm -it ghcr.io/ikawaha/kagome:latest
# Server mode (http://localhost:6060)
docker run --rm -p 6060:6060 ikawaha/kagome:latest server
# If pulling from GitHub Container Registry
docker run --rm -p 6060:6060 ghcr.io/ikawaha/kagome:latest server
You can see how kagome wasm works in demo site.
The source code can be found in ./sample/wasm
.
MIT