• Stars
    star
    129
  • Rank 279,262 (Top 6 %)
  • Language
  • Created over 4 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

📝 A list of pre-trained BERT models for Japanese with word/subword tokenization + vocabulary construction algorithm information

awesome-bert-japanese

日本語の学習済み BERT は文から単語への分かち書き,単語からサブワードへの分割の処理にいくつかの選択肢が存在します. また,単語をサブワードに分割する際に利用する語彙についても構築方法に数種類のバリエーションがあります.

本リポジトリでは,公開されている学習済み BERT モデルについて, 分かち書き・サブワード分割・語彙構築アルゴリズムそれぞれどのアルゴリズムが採用されているかを表にまとめています.

A list of pre-trained BERT models for Japanese. Japanese is a complicated language; which doesn't have any word boundaries and has many kind of characters. Therefore, it requires word segmentation before tokenizing word into subwords. I summarize pretrained BERT models for Japanese by word segmentation algorithm, subword tokenization algorithm, and algorithm for constructing vocabulary used in subword tokenization.

Model

Model Sentence -> Words Word -> Subword Algorithm for constructing vocabulary used in subword tokenization
Google (Multilingual BERT) Whitespace WordPiece BPE?
Kikuta -- Sentencepiece (without word segmentation) Sentencepiece (model_type=unigram)
Hotto Link Inc. -- Sentencepiece (without word segmentation) Sentencepiece (model_type=unigram)
Kyoto University Juman++ (JUMANDIC?) WordPiece subword-nmt (BPE)
Stockmark Inc. (a) MeCab (mecab-ipadic-neologd) -- --
Tohoku University (a) MeCab (mecab-ipadic) WordPiece Sentencepiece (model_type=bpe)
Tohoku University (b) MeCab (mecab-ipadic) Character Sentencepiece (model_type=character)
NICT (a) MeCab (mecab-jumandic) WordPiece subword-nmt (BPE)
NICT (b) MeCab (mecab-jumandic) --- ---
akirakubo (a) MeCab (unidic-cwj) for Wikipedia and Aozora bunko written in 新仮名 + MeCab (unidic_qkana) for Aozora bunko written in 旧仮名 WordPiece subword-nmt (BPE)
akirakubo (b) SudachiPy (SudachiDict_core + A mode) for Wikipedia and Aozora bunko written in 新仮名 + MeCab (unidic_qkana) for Aozora bunko written in 旧仮名 WordPiece subword-nmt (BPE)
The University of Tokyo MeCab (mecab-ipadic-neologd + user dic (J-MeDic) WordPiece ? (BPE)
Laboro.AI Inc. -- Sentencepiece (without word segmentation) Sentencepiece (model_type=unigram)
Bandai Namco Research Inc. MeCab (mecab-ipadic) WordPiece Sentencepiece (model_type=bpe)
Retrieva, Inc. MeCab (mecab-ipadic) WordPiece Sentencepiece (model_type=bpe)
Waseda University Juman++ (JUMANDIC) WordPiece Sentencepiece (model_type=unigram)
LINE Corp. MeCab (mecab-unidic) WordPiece Sentencepiece (model_type=bpe)
Stockmark Inc. (b) MeCab (mecab-ipadic-neologd) WordPiece Sentencepiece (model_type=?)
  • NICT: National Institute of Information and Communications Technology
  • without word segmentation: 文を単語に分割せず直接サブワードへ分割する
  • For models by Tohoku University, MeCab+mecab-ipadic-neologd is used for sentence segmentation (thanks @ikuyamada san!)
  • For models by akirakubo, documents in Aozora bunko are classified into two categories. It is based on types of kana spelling. (thanks @kkadowa san and @akirakubo san!
  • For DistilBERT (by Bandai Namco Resean Inc.), the same word segmentation and algorithm for constructing vocabulary are used both for teacher/studen models.

Reference

More Repositories

1

konoha

🌿 An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.
Python
213
star
2

pyner

🌈 Implementation of Neural Network based Named Entity Recognizer (Lample+, 2016) using Chainer.
Python
45
star
3

allennlp-optuna

⚡️ AllenNLP plugin for adding subcommands to use Optuna, making hyperparameter optimization easy
Python
32
star
4

dotfiles

🔯 A collection of my rc files (tmux, neovim, zsh, fish, poetry, git, ...etc) and utilities that make everyday coding fun!
Shell
24
star
5

nlp-100knock

⚾ 2015 nlp-100knock (言語処理100本ノック): Archive of my solutions
Jupyter Notebook
22
star
6

optuna-allennlp

🚀 A demonstration of hyperparameter optimization using Optuna for models implemented with AllenNLP.
Jupyter Notebook
16
star
7

allennlp-NER

☯️ AllenNLP training configurations for promising models on Named Entity Recognition. (BiLSTM-CRF, BiLSTM-CNN-CRF, BERT, BERT-CRF)
Python
15
star
8

interest

👀 Interest: Organizing papers+materials which you are interested in. Serverless application powered by GitHub pages + Google Spreadsheet.
TypeScript
14
star
9

kirinoha

Kirinoha-桐の葉: A class search system for students in University of Tsukuba
Ruby
9
star
10

sist02-cli

command line tool to make usr of sist02
Ruby
9
star
11

polaris

simple sentimental analyzer based on polarity words dictionary
Ruby
8
star
12

docstring.nvim

✒️ Python docstring generating tool for NeoVim
Python
8
star
13

sist02

simple library to make citation more easily
Ruby
6
star
14

bulletin_bot

A notification tool for students in University of Tsukuba
Python
5
star
15

ndl

gem to make use of NLP API
Ruby
5
star
16

syllabus

Ruby
5
star
17

survey

3
star
18

keras_example

Reimplementations
Python
3
star
19

cargo-atcoder-vscode

🔧 Visual Studio Code extension for AtCoder, especially for Rustacean, based on cargo-atcoder
TypeScript
3
star
20

trackyou

Python
2
star
21

algorithm-rs

🛠️ Data structure and algorithm written in Rust
Rust
2
star
22

problems.2022

Rust
2
star
23

optuna-test-rtds

Python
2
star
24

pagerank

A simple library for calculating PageRank
Ruby
2
star
25

fastapi-todos

Python
2
star
26

studynow

Ruby
2
star
27

twintter

twintter provides you a method to search a subject and broadcast it
Ruby
2
star
28

dobato

Python
2
star
29

optuna-allennlp-reproduction

A repo for investigating a reproductivity problem in Optuna x AllenNLP
Python
1
star
30

atcoder-accept-count

Go
1
star
31

dobato-go

Go
1
star
32

mypy-0790-internal-error

Python
1
star
33

commonlitreadabilityprize

Jupyter Notebook
1
star
34

rblearn

library for nlp with ruby
Ruby
1
star
35

acl-anthology

Python
1
star
36

sist02-web

sist02 web client
Ruby
1
star
37

notebooks

My notebooks for learning various models in machine learning
Jupyter Notebook
1
star
38

problems.2019

競技プログラミングとか
C++
1
star
39

setuppy2pyprojecttoml

Python
1
star
40

deeplearning4nlp

Python
1
star
41

omoide

🤖 Browse and clean tweets in CLI
Rust
1
star
42

tweet-cleaner

Ruby
1
star
43

kdb2tsv

kdbでエクスポートしたExcelファイルを元にtsvファイルを作成します
Ruby
1
star
44

optuna-e2e

Python
1
star
45

frontend-template

TypeScript
1
star
46

pysen-docs

Python
1
star