awesome-japanese-nlp-resources
A curated list of resources dedicated to Python libraries, pre-trained models, dictionaries, and corpora of NLP for Japanese
This list includes 489 Japanese NLP repositories.
A tool for searching these repositories is available on Hugging Face Spaces.
Your contributions are always welcome!
Please read the Contribution guidelines before contributing.
Resources that are not available on GitHub are added to the wiki .
English | 日本語 (Japanese) | 繁體中文 (Chinese) | 简体中文 (Chinese)
☝ By using ChatGPT, we were able to improve the translation results.
The latest additions 🎉
Tutorial
llm-book - 「大規模言語モデル入門」(技術評論社, 2023)のGitHubリポジトリ
Updated on Aug 09, 2023
🏅 bomin0624 provided some repository information. Thank you!
Contents
Python library
Morphology analysis
sudachi.rs - SudachiPy 0.6* and above are developed as Sudachi.rs.
Janome - Japanese morphological analysis engine written in pure Python
mecab-python3 - mecab-python. mecab-python. you can find original version here:http://taku910.github.io/mecab/
mecab - This repository is for building Windows 64-bit MeCab binary and improving MeCab Python binding.
fugashi - A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.
nagisa - A Japanese tokenizer based on recurrent neural networks
pyknp - A Python Module for JUMAN++/KNP
Mykytea-python - Python wrapper for KyTea
konoha - Konoha: Simple wrapper of Japanese Tokenizers
natto-py - natto-py combines the Python programming language with MeCab, the part-of-speech and morphological analyzer for the Japanese language.
rakutenma-python - Rakuten MA (Python version)
python-vaporetto - Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.
dango - An easy to use tokenizer for Japanese text, aimed at language learners and non-linguists
rhoknp - Yet another Python binding for Juman++/KNP
python-vibrato - Viterbi-based accelerated tokenizer (Python wrapper)
Parsing
ginza - A Japanese NLP Library using spaCy as framework based on Universal Dependencies
cabocha - Yet Another Japanese Dependency Structure Analyzer
UniDic2UD - Tokenizer POS-tagger Lemmatizer and Dependency-parser for modern and contemporary Japanese
camphr - Camphr - NLP libary for creating pipeline components
SuPar-UniDic - Tokenizer POS-tagger Lemmatizer and Dependency-parser for modern and contemporary Japanese with BERT models
depccg - A* CCG Parser with a Supertag and Dependency Factored Model
bertknp - A Japanese dependency parser based on BERT
esupar - Tokenizer POS-Tagger and Dependency-parser with BERT/RoBERTa/DeBERTa models for Japanese and other languages
yomikata - Heteronym disambiguation library using a fine-tuned BERT model.
Converter
pykakasi - Lightweight converter from Japanese Kana-kanji sentences into Kana-Roman.
cutlet - Japanese to romaji converter in Python
alphabet2kana - Convert English alphabet to Katakana
Convert-Numbers-to-Japanese - Converts Arabic numerals, or 'western' style numbers, to a Japanese context.
mozcpy - Mozc for Python: Kana-Kanji converter
jamorasep - Japanese text parser to separate Hiragana/Katakana string into morae (syllables).
text2phoneme - 日本語文を音素列へ変換するスクリプト
jntajis-python - A fast character conversion and transliteration library based on the scheme defined for Japan National Tax Agency (国税庁) 's
Preprocessor
neologdn - Japanese text normalizer for mecab-neologd
jaconv - Pure-Python Japanese character interconverter for Hiragana, Katakana, Hankaku, and Zenkaku
mojimoji - A fast converter between Japanese hankaku and zenkaku characters
text-cleaning - A powerful text cleaner for Japanese web texts
HojiChar - 複数の前処理を構成して管理するテキスト前処理ツール
Sentence spliter
Bunkai - Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)
japanese-sentence-breaker - Japanese Sentence Breaker
sengiri - Yet another sentence-level tokenizer for the Japanese text
budoux - Standalone. Small. Language-neutral. BudouX is the successor to Budou, the machine learning powered line break organizer tool.
ja_sentence_segmenter - japanese sentence segmentation library for python
hasami - A tool to perform sentence segmentation on Japanese text
kuzukiri - Japanese Text Segmenter for Python written in Rust
ja-senter-benchmark - Comparison of Japanese Sentence Segmentation Tools
Sentiment analysis
oseti - Dictionary based Sentiment Analysis for Japanese
negapoji - Japanese negative positive classification.日本語文書のネガポジを判定。
pymlask - Emotion analyzer for Japanese text
asari - Japanese sentiment analyzer implemented in Python.
Machine translation
jparacrawl-finetune - An example usage of JParaCrawl pre-trained Neural Machine Translation (NMT) models.
JASS - JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation (LREC2020) & Linguistically Driven Multi-Task Pre-Training for Low-Resource Neural Machine Translation (ACM TALLIP)
PheMT - A phenomenon-wise evaluation dataset for Japanese-English machine translation robustness. The dataset is based on the MTNT dataset, with additional annotations of four linguistic phenomena; Proper Noun, Abbreviated Noun, Colloquial Expression, and Variant. COLING 2020.
VISA - An ambiguous subtitles dataset for visual scene-aware machine translation
Named entity recognition
namaco - Character Based Named Entity Recognition.
entitypedia - Entitypedia is an Extended Named Entity Dictionary from Wikipedia.
noyaki - Converts character span label information to tokenized text-based label information.
bert-japanese-ner-finetuning - Code to perform finetuning of the BERT model. BERTモデルのファインチューニングで固有表現抽出用タスクのモデルを作成・使用するサンプルです
joint-information-extraction-hs - 詳細なアノテーション基準に基づく症例報告コーパスからの固有表現及び関係の抽出精度の推論を行うコード
OCR
Manga OCR - About Optical character recognition for Japanese text, with the main focus being Japanese manga
mokuro - Read Japanese manga inside browser with selectable text.
handwritten-japanese-ocr - Handwritten Japanese OCR demo using touch panel to draw the input text using Intel OpenVINO toolkit
OCR_Japanease - 日本語OCR
ndlocr_cli - NDLOCRのアプリケーション
donut - Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
JMTrans - manga translator - get japanese manga from url to translate manga image
Kindai-OCR - OCR system for recognizing modern Japanese magazines
text_recognition - NDLOCR用テキスト認識モジュール
Poricom - Optical character recognition in manga images. Manga OCR desktop application
Tool for pretrained models
Others
namedivider-python - A tool for dividing the Japanese full name into a family name and a given name.
asa-python - A curated list of resources dedicated to Python libraries of NLP for Japanese
python_asa - python版日本語意味役割付与システム(ASA)
toiro - A comparison tool of Japanese tokenizers
ja-timex - 自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器
JapaneseTokenizers - A set of metrics for feature selection from text data
daaja - This repository has implementations of data augmentation for NLP for Japanese.
accel-brain-code - The purpose of this repository is to make prototypes as case study in the context of proof of concept(PoC) and research and development(R&D) that I have written in my website. The main research topics are Auto-Encoders in relation to the representation learning, the statistical machine learning for energy-based models, adversarial generation net…
kyoto-reader - A processor for KyotoCorpus, KWDLC, and AnnotatedFKCCorpus
nlplot - Visualization Module for Natural Language Processing
rake-ja - Rapid Automatic Keyword Extraction algorithm for Japanese
jel - Japanese Entity Linker.
MedNER-J - Latest version of MedEX/J (Japanese disease name extractor)
zunda-python - Zunda: Japanese Enhanced Modality Analyzer client for Python.
AIO2_DPR_baseline - https://www.nlp.ecei.tohoku.ac.jp/projects/aio/
showcase - A PyTorch implementation of the Japanese Predicate-Argument Structure (PAS) analyser presented in the paper of Matsubayashi & Inui (2018) with some improvements.
darts-clone-python - Darts-clone python binding
jrte-corpus_example - Example codes for Japanese Realistic Textual Entailment Corpus
desuwa - Feature annotator to morphemes and phrases based on KNP rule files (pure-Python)
HotPepperGourmetDialogue - Restaurant Search System through Dialogue in Japanese.
nlp-recipes-ja - Samples codes for natural language processing in Japanese
Japanese_nlp_scripts - Small example scripts for working with Japanese texts in Python
DNorm-J - Japanese version of DNorm
pyknp-eventgraph - EventGraph is a development platform for high-level NLP applications in Japanese.
ishi - Ishi: A volition classifier for Japanese
python-npylm - ベイズ階層言語モデルによる教師なし形態素解析
python-npycrf - 条件付確率場とベイズ階層言語モデルの統合による半教師あり形態素解析
unsupervised-pos-tagging - 教師なし品詞タグ推定
negima - Negima is a Python package to extract phrases in Japanese text by using the part-of-speeches based rules you defined.
YouyakuMan - Extractive summarizer using BertSum as summarization model
japanese-numbers-python - A parser for Japanese number (Kanji, arabic) in the natural language.
kantan - Lookup japanese words by radical patterns
make-meidai-dialogue - Get Japanese dialogue corpus
japanese_summarizer - A summarizer for Japanese articles.
chirptext - ChirpText is a collection of text processing tools for Python.
yubin - Japanese Address Munger
jawiki-cleaner - Japanese Wikipedia Cleaner
japanese2phoneme - A python library to convert Japanese to phoneme.
anlp_nlp2021_d3-1 - This repository contains codes related to the experiments in "An Experimental Evaluation of Japanese Tokenizers for Sentiment-Based Text Classification"
aozora_classification - About
This project aims to classify Japanese sentence to how well similar to some Japanese classical writers, such as Soseki Natsume, Ogai Mori, Ryunosuke Akutagawa and so on.
aozora-corpus-generator - Generates plain or tokenized text files from the Aozora Bunko
JLM - A fast LSTM Language Model for large vocabulary language like Japanese and Chinese
NTM - Testing of Neural Topic Modeling for Japanese articles
EN-JP-ML-Lexicon - This is a English-Japanese lexicon for Machine Learning and Deep Learning terminology.
text-generation - Easy-to-use scripts to fine-tune GPT-2-JA with your own texts, to generate sentences, and to tweet them automatically.
chainer_nic - Neural Image Caption (NIC) on chainer, its pretrained models on English and Japanese image caption datasets.
unihan-lm - The official repository for "UnihanLM: Coarse-to-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database", AACL-IJCNLP 2020
mbart-finetuning - Code to perform finetuning of the mBART model.
xvector_jtubespeech - xvector model on jtubespeech
TinySegmenterMaker - TinySegmenter用の学習モデルを自作するためのツール.
Grongish - 日本語とグロンギ語の相互変換スクリプト
WordCloud-Japanese - WordCloudでの日本語文章をMecab(形態素解析エンジン)を使用せずに形態素解析チックな表示を実現するスクリプト
snark - 日本語ワードネットを利用したDBアクセスライブラリ
toEmoji - 日本語文を絵文字だけの文に変換するなにか
termextract - - 専門用語抽出アルゴリズムの実装の練習
JDT-with-KenLM-scoring - Japanese-Dialog-Transformerの応答候補に対して、KenLMによるN-gram言語モデルでスコアリングし、フィルタリング若しくはリランキングを行う。
mixture-of-unigram-model - Mixture of Unigram Model and Infinite Mixture of Unigram Model in Python. (混合ユニグラムモデルと無限混合ユニグラムモデル)
hidden-markov-model - Hidden Markov Model (HMM) and Infinite Hidden Markov Model (iHMM) in Python. (隠れマルコフモデルと無限隠れマルコフモデル)
Ngram-language-model - Ngram language model in Python. (Nグラム言語モデル)
ASRDeepSpeech - Automatic Speech Recognition with deepspeech2 model in pytorch with support from Zakuro AI.
neural_ime - Neural IME: Neural Input Method Engine
neural_japanese_transliterator - Can neural networks transliterate Romaji into Japanese correctly?
tinysegmenter - tokenizer specified for Japanese
AugLy-jp - Data Augmentation for Japanese Text on AugLy
furigana4epub - A Python script for adding furigana to Japanese epub books using Mecab and Unidic.
PyKatsuyou - Japanese verb/adjective inflections tool
jageocoder - Pure Python Japanese address geocoder
pygeonlp - pygeonlp, A python module for geotagging Japanese texts.
nksnd - New kana-kanji conversion engine
JaMIE - A Japanese Medical Information Extraction Toolkit
fasttext-vs-word2vec-on-twitter-data - fasttextとword2vecの比較と、実行スクリプト、学習スクリプトです
minimal-search-engine - 最小のサーチエンジン/PageRank/tf-idf
5ch-analysis - 5chの過去ログをスクレイピングして、過去流行った単語(ex, 香具師, orz)などを追跡調査
tweet_extructor - Twitter日本語評判分析データセットのためのツイートダウンローダ
japanese-word-aggregation - Aggregating Japanese words based on Juman++ and ConceptNet5.5
jinf - A Japanese inflection converter
kwja - A unified language analyzer for Japanese
mlm-scoring-transformers - Reproduced package based on Masked Language Model Scoring (ACL2020).
ClipCap-for-Japanese - [PyTorch] ClipCap for Japanese
SAT-for-Japanese - [PyTorch] Show, Attend and Tell for Japanese
cihai - Python library for CJK (Chinese, Japanese, and Korean) language dictionary
marine - MARINE : Multi-task leaRnIng-based JapaNese accent Estimation
whisper-asr-finetune - Finetuning Whisper ASR model
japanese_chatbot - A PyTorch Implementation of japanese chatbot using BERT and Transformer's decoder
radicalchar - 部首文字正規化ライブラリ
akaza - Yet another Japanese IME for IBus/Linux
posuto - Japanese postal code data.
tacotron2-japanese - Tacotron2 implementation of Japanese
ibus-hiragana - ひらがなIME for IBus
furiganapad - ふりがなパッド
chikkarpy - Japanese synonym library
ja-tokenizer-docker-py - Mecab + NEologd + Docker + Python3
JapaneseEmbeddingEval - JapaneseEmbeddingEval
gptuber-by-langchain - GPTがYouTuberをやります
shuwa - Extend GNOME On-Screen Keyboard for Input Methods
japanese-nli-model - This repository provides the code for Japanese NLI model, a fine-tuned masked language model.
tra-fugu - A tool for Japanese-English translation and English-Japanese translation by using FuguMT
fugumt - ぷるーふおぶこんせぷと で公開した機械翻訳エンジンを利用する翻訳環境です。 フォームに入力された文字列の翻訳、PDFの翻訳が可能です。
JaSPICE - JaSPICE: Automatic Evaluation Metric Using Predicate-Argument Structures for Image Captioning Models
Retrieval-based-Voice-Conversion-WebUI-JP-localization - jp-localization
pyopenjtalk - Python wrapper for OpenJTalk
yomigana-ebook - Make learning Japanese easier by adding readings for every kanji in the eBook
N46Whisper - Whisper based Japanese subtitle generator
C++
Morphology analysis
mecab - Yet another Japanese morphological analyzer
jumanpp - Juman++ (a Morphological Analyzer Toolkit)
kytea - The Kyoto Text Analysis Toolkit for word segmentation and pronunciation estimation, etc.
Parsing
cabocha - Yet Another Japanese Dependency Structure Analyzer
knp - A Japanese Parser
Name
downloads/week
total downloads
stars
cabocha
-
-
knp
-
-
Others
jsc - Joint source channel model for Japanese Kana Kanji conversion, Chinese pinyin input and CJE mixed input.
aquaskk - An input method without morphological analysis.
mozc - Mozc - a Japanese Input Method Editor designed for multi-platform
trimatch - Trimatch: An (Exact|Prefix|Approximate) String Matching Library
resembla - Resembla: Word-based Japanese similar sentence search library
corvusskk - ▽▼ SKK-like Japanese Input Method Editor for Windows
Rust crate
Morphology analysis
lindera - A morphological analysis library.
vaporetto - Vaporetto: Very Accelerated POintwise pREdicTion based TOkenizer
goya - Japanese Morphological Analysis written in Rust
vibrato - vibrato: Viterbi-based accelerated tokenizer
yoin - A Japanese Morphological Analyzer written in pure Rust
mecab-rs - Safe Rust bindings for mecab a part-of-speech and morphological analyzer library
awabi - A morphological analyzer using mecab dictionary
Converter
wana_kana_rust - Utility library for checking and converting between Japanese characters - Hiragana, Katakana - and Romaji
unicode-jp-rs - A Rust library to convert Japanese Half-width-kana[半角カナ] and Wide-alphanumeric[全角英数] into normal ones
kana - [Mirror] CLI program for transliterating romaji text to either hiragana or katakana
Search engine library
Others
daachorse - A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure in Rust.
find-simdoc - Finding all pairs of similar documents time- and memory-efficiently
crawdad - Rust library of natural language dictionaries using character-wise double-array tries.
tokenizer-speed-bench - Comparison code of various tokenizers
stringmatch-bench - Here provides benchmark tools to compare the performance of data structures for string matching.
vime - Using Vim as an input method for X11 apps
voicevox_core - 無料で使える中品質なテキスト読み上げソフトウェア、VOICEVOXのコア
akaza - Yet another Japanese IME for IBus/Linux
Jotoba - A free online, self-hostable, multilang Japanese dictionary.
dvorakjp-romantable - Google 日本語入力用DvorakJPローマ字テーブル / DvorakJP Roman Table for Google Japanese Input
niinii - Japanese glossator for assisted reading of text using Ichiran
JavaScript
Morphology analysis
kuromoji.js - JavaScript implementation of Japanese morphological analyzer
rakutenma - Rakuten MA - morphological analyzer (word segmentor + PoS Tagger) for Chinese and Japanese written purely in JavaScript.
Resources
node-mecab-ya - Yet another mecab wrapper for nodejs
juman-bin - a User-Extensible Morphological Analyzer for Japanese. 日本語形態素解析システム
node-mecab-async - Asynchronous japanese morphological analyser using MeCab.
Converter
kuroshiro - Japanese language library for converting Japanese sentence to Hiragana, Katakana or Romaji with furigana and okurigana modes supported.
kuroshiro-analyzer-kuromoji - Kuromoji morphological analyzer for kuroshiro.
hepburn - Node.js module for converting Japanese Hiragana and Katakana script to, and from, Romaji using Hepburn romanisation
japanese-numerals-to-number - Converts Japanese Numerals into number
jslingua - Javascript libraries to process text: Arabic, Japanese, etc.
WanaKana - Javascript library for detecting and transliterating Hiragana <--> Katakana <--> Romaji
node-romaji-name - Normalize and fix common issues with Romaji-based Japanese names.
kyujitai.js - Utility collections for making Japanese text old-fashioned
normalize-japanese-addresses - オープンソースの住所正規化ライブラリ。
Others
Go
Morphology analysis
kagome - Self-contained Japanese Morphological Analyzer written in pure Go
Name
downloads/week
total downloads
stars
kagome
-
-
Others
ojosama - テキストを壱百満天原サロメお嬢様風の口調に変換します
nihongo - Japanese Dictionary
yomichan-import - External dictionary importer for Yomichan.
imas-ime-dic - THE IDOLM@STER words dictionary for Japanese IME (by imas-db.jp)
go-moji - A Go library for Zenkaku/Hankaku conversion
Java
Morphology analysis
kuromoji - Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search
Sudachi - A Japanese Tokenizer for Business
SudachiDict - A lexicon for Sudachi
Others
kanjitomo-ocr - Java library for identifying Japanese characters from images
jakaroma - Java library and command-line tool to transliterate Japanese kanji to romaji (Latin alphabet)
kakasi-java - Kanji transliteration to hiragana/katakana/romaji, in Java
Kamite - A desktop language immersion companion for learners of Japanese
react-native-japanese-tokenizer - Async Japanese Tokenizer Native Plugin for React Native for iOS and Android
elasticsearch-analysis-japanese - Japanese analyzer uses kuromoji japanese tokenizer for ElasticSearch
moji4j - A Java library to converts between Japanese Hiragana, Katakana, and Romaji scripts.
neologdn-java - Japanese text normalizer for mecab-neologd
Pretrained model
Word2Vec
Transformer based models
bert-japanese - BERT models for Japanese text.
japanese-pretrained-models - Code for producing Japanese pretrained models provided by rinna Co., Ltd.
bert-japanese - BERT with SentencePiece for Japanese text.
SudachiTra - Japanese tokenizer for Transformers
japanese-dialog-transformers - Code for evaluating Japanese pretrained models provided by NTT Ltd.
shiba - Pytorch implementation and pre-trained Japanese model for CANINE, the efficient character-level transformer.
Dialog - A PyTorch Implementation of japanese chatbot using BERT and Transformer's decoder
language-pretraining - BERT and ELECTRA models of PyTorch implementations for Japanese text.
medbertjp - Trials of pre-trained BERT models for the medical domain in Japanese.
ILYS-aoba-chatbot - ILYS-aoba-chatbot
t5-japanese - Codes to pre-train Japanese T5 models
pytorch_bert_japanese - PytorchでBERTの日本語学習済みモデルを利用する
Laboro-BERT-Japanese - Laboro BERT Japanese: Japanese BERT Pre-Trained With Web-Corpus
RoBERTa-japanese - Japanese BERT Pretrained Model
aMLP-japanese - aMLP Transformer Model for Japanese
bert-japanese-aozora - Japanese BERT trained on Aozora Bunko and Wikipedia, pre-tokenized by MeCab with UniDic & SudachiPy
sbert-ja - Code to train Sentence BERT Japanese model for Hugging Face Model Hub
BERT-Japan-vaccination - Official fine-tuning code for "Emotion Analysis of Japanese Tweets and Comparison to Vaccinations in Japan"
gpt2-japanese - Japanese GPT2 Generation Model
text2text-japanese - gpt-2 based text2text conversion model
gpt-ja - GPT-2 Japanese model for HuggingFace's transformers
friendly_JA-Model - MT model trained using the friendly_JA Corpus attempting to make Japanese easier/more accessible to occidental people by using the Latin/English derived katakana lexicon instead of the standard Sino-Japanese lexicon
albert-japanese - BERT with SentencePiece for Japanese text.
ja_text_bert - 日本語WikipediaコーパスでBERTのPre-Trainedモデルを生成するためのリポジトリ
DistilBERT-base-jp - A Japanese DistilBERT pretrained model, which was trained on Wikipedia.
bert - This repository provides snippets to use RoBERTa pre-trained on Japanese corpus. Our dataset consists of Japanese Wikipedia and web-scrolled articles, 25GB in total. The released model is built based on that from HuggingFace.
Laboro-DistilBERT-Japanese - Laboro DistilBERT Japanese
luke - LUKE -- Language Understanding with Knowledge-based Embeddings
GPTSAN - General-purpose Swich transformer based Japanese language mode
japanese-clip - Japanese CLIP by rinna Co., Ltd.
AcademicBART - We pretrained a BART-based Japanese masked language model on paper abstracts from the academic database CiNii Articles
AcademicRoBERTa - We pretrained a RoBERTa-based Japanese masked language model on paper abstracts from the academic database CiNii Articles.
LINE-DistilBERT-Japanese - DistilBERT model pre-trained on 131 GB of Japanese web text. The teacher model is BERT-base that built in-house at LINE.
Japanese-Alpaca-LoRA - 日本語に翻訳したStanford Alpacaのデータセットを用いてLLaMAをファインチューニングし作成したLow-Rank AdapterのリンクとGenerateサンプルコード
japanese-gpt-neox-3.6b - This repository provides a Japanese GPT-NeoX model of 3.6 billion parameters. The model was trained using code based on EleutherAI/gpt-neox.
japanese-gpt-neox-3.6b-instruction-sft - This repository provides a Japanese GPT-NeoX model of 3.6 billion parameters. The model is based on rinna/japanese-gpt-neox-3.6b and has been finetuned to serve as a instruction-following conversational agent.
japanese-hubert-base - This is a Japanese HuBERT (Hidden Unit Bidirectional Encoder Representations from Transformers) model trained by rinna Co., Ltd.
open-calm-7b - OpenCALM is a suite of decoder-only language models pre-trained on Japanese datasets, developed by CyberAgent, Inc.
luke-japanese-base-finetuned-ner - このモデルはluke-japanese-baseをファインチューニングして、固有表現抽出(NER)に用いれるようにしたものです。
albert-japanese-tinysegmenter - Pretrained models, codes and guidances to pretrain official ALBERT(https://github.com/google-research/albert ) on Japanese Wikipedia Resources
japanese-mpt-7b - lightblue/japanese-mpt-7b
ChatGPT
Dictionary
Corpus
Part-of-speech tagging / Named entity recognition
Parallel corpus
Dialog corpus
JMRD - Japanese Movie Recommendation Dialogue dataset
open2ch-dialogue-corpus - おーぷん2ちゃんねるをクロールして作成した対話コーパス
BSD - The Business Scene Dialogue corpus
asdc - Accommodation Search Dialog Corpus (宿泊施設探索対話コーパス)
japanese-corpus - 日本語の対話データ for seq2seq etc
BPersona-chat - This repository contains the Japanese–English bilingual chat corpus BPersona-chat published in the paper Chat Translation Error Detection for Assisting Cross-lingual Communications at AACL-IJCNLP 2022's Workshop Eval4NLP 2022.
japanese-daily-dialogue - Japanese Daily Dialogue, or 日本語日常対話コーパス in Japanese, is a high-quality multi-turn dialogue dataset containing daily conversations on five topics: dailylife, school, travel, health, and entertainment.
llm-japanese-dataset - LLM構築用の日本語チャットデータセット
Others
Tutorial
Research summary
Reference
Contributors