• Stars
    star
    506
  • Rank 87,236 (Top 2 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created about 5 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

BERT models for Japanese text.

Pretrained Japanese BERT models

This is a repository of pretrained Japanese BERT models. The models are available in Transformers by Hugging Face.

This version of README contains information for the following models:

For information and codes for the following models, refer to the v2.0 tag of this repository:

For information and codes for the following models, refer to the v1.0 tag of this repository:

Model Architecture

The architecture of our models are the same as the original BERT models proposed by Google.

  • BERT-base models consist of 12 layers, 768 dimensions of hidden states, and 12 attention heads.
  • BERT-large models consist of 24 layers, 1024 dimensions of hidden states, and 16 attention heads.

Training Data

The models are trained on the Japanese portion of CC-100 dataset and the Japanese version of Wikipedia. For Wikipedia, we generated a text corpus from the Wikipedia Cirrussearch dump file as of January 2, 2023.

The corpus files generated from CC-100 and Wikipedia are 74.3GB and 4.9GB in size and consist of approximately 392M and 34M sentences, respectively.

For the purpose of splitting texts into sentences, we used fugashi with mecab-ipadic-NEologd dictionary (v0.0.7).

Generating corpus files

# For CC-100
$ mkdir -p $WORK_DIR/corpus/cc-100
$ python merge_split_corpora.py \
--input_files $DATA_DIR/cc-100/ja.txt.xz \
--output_dir $WORK_DIR/corpus/cc-100 \
--num_files 64

# For Wikipedia
$ mkdir -p $WORK_DIR/corpus/wikipedia
$ python make_corpus_wiki.py \
--input_file $DATA_DIR/wikipedia/cirrussearch/20230102/jawiki-20230102-cirrussearch-content.json.gz \
--output_file $WORK_DIR/corpus/wikipedia/corpus.txt.gz \
--min_sentence_length 10 \
--max_sentence_length 200 \
--mecab_option '-r <path to etc/mecabrc> -d <path to lib/mecab/dic/mecab-ipadic-neologd>'
$ python merge_split_corpora.py \
--input_files $WORK_DIR/corpus/wikipedia/corpus.txt.gz \
--output_dir $WORK_DIR/corpus/wikipedia \
--num_files 8

# Sample 1M sentences for training tokenizers
$ cat $WORK_DIR/corpus/wikipedia/corpus_*.txt|grep -a -v '^$'|shuf|head -n 10000000 > $WORK_DIR/corpus/wikipedia/corpus_sampled.txt

Tokenization

For each of BERT-base and BERT-large, we provide two models with different tokenization methods.

  • For wordpiece models, the texts are first tokenized by MeCab with the Unidic 2.1.2 dictionary and then split into subwords by the WordPiece algorithm. The vocabulary size is 32768.
  • For character models, the texts are first tokenized by MeCab with the Unidic 2.1.2 dictionary and then split into characters. The vocabulary size is 7027, which covers all the characters present in Unidic 2.1.2 dictionary.

We used unidic-lite dictionary for tokenization.

Generating a set of characters

$ mkdir -p $WORK_DIR/tokenizers/alphabet
$ python make_alphabet_from_unidic.py \
--lex_file $DATA_DIR/unidic-mecab-2.1.2_src/lex.csv \
--output_file $WORK_DIR/tokenizers/alphabet/unidic_lite.txt

Training tokenizers

# WordPiece
$ python train_tokenizer.py \
--input_files $WORK_DIR/corpus/wikipedia/corpus_sampled.txt \
--output_dir $WORK_DIR/tokenizers/wordpiece_unidic_lite \
--pre_tokenizer_type mecab \
--mecab_dic_type unidic_lite \
--vocab_size 32768 \
--limit_alphabet 7012 \
--initial_alphabet_file $WORK_DIR/tokenizers/alphabet/unidic_lite.txt \
--num_unused_tokens 10 \
--wordpieces_prefix '##'

# Character
$ mkdir $WORK_DIR/tokenizers/character_unidic_lite
$ head -n 7027 $WORK_DIR/tokenizers/wordpiece_unidic_lite/vocab.txt > $WORK_DIR/tokenizers/character_unidic_lite/vocab.txt

Generating pretraining data

# WordPiece on CC-100
# Each process takes about 2h50m and 60GB RAM, producing 15.2M instances
$ mkdir -p $WORK_DIR/pretraining_data/wordpiece_unidic_lite/cc-100
$ seq -f %02g 1 64|xargs -L 1 -I {} -P 2 \
python create_pretraining_data.py \
--input_file $WORK_DIR/corpus/cc-100/corpus_{}.txt \
--output_file $WORK_DIR/pretraining_data/wordpiece_unidic_lite/cc-100/pretraining_data_{}.tfrecord.gz \
--vocab_file $WORK_DIR/tokenizers/wordpiece_unidic_lite/vocab.txt \
--word_tokenizer_type mecab \
--subword_tokenizer_type wordpiece \
--mecab_dic_type unidic_lite \
--do_whole_word_mask \
--gzip_compress \
--use_v2_feature_names \
--max_seq_length 128 \
--max_predictions_per_seq 19 \
--masked_lm_prob 0.15 \
--dupe_factor 5

# WordPiece on Wikipedia
# Each process takes about 7h30m and 138GB RAM, producing 18.4M instances
$ mkdir -p $WORK_DIR/pretraining_data/wordpiece_unidic_lite/wikipedia
$ seq -f %02g 1 8|xargs -L 1 -I {} -P 1 \
python create_pretraining_data.py \
--input_file $WORK_DIR/corpus/wikipedia/corpus_{}.txt \
--output_file $WORK_DIR/pretraining_data/wordpiece_unidic_lite/wikipedia/pretraining_data_{}.tfrecord.gz \
--vocab_file $WORK_DIR/tokenizers/wordpiece_unidic_lite/vocab.txt \
--word_tokenizer_type mecab \
--subword_tokenizer_type wordpiece \
--mecab_dic_type unidic_lite \
--do_whole_word_mask \
--gzip_compress \
--use_v2_feature_names \
--max_seq_length 512 \
--max_predictions_per_seq 76 \
--masked_lm_prob 0.15 \
--dupe_factor 30

# Character on CC-100
# Each process takes about 3h30m and 82GB RAM, producing 18.4M instances
$ mkdir -p $WORK_DIR/pretraining_data/character_unidic_lite/cc-100
$ seq -f %02g 1 64|xargs -L 1 -I {} -P 2 \
python create_pretraining_data.py \
--input_file $WORK_DIR/corpus/cc-100/corpus_{}.txt \
--output_file $WORK_DIR/pretraining_data/character_unidic_lite/cc-100/pretraining_data_{}.tfrecord.gz \
--vocab_file $WORK_DIR/tokenizers/character_unidic_lite/vocab.txt \
--word_tokenizer_type mecab \
--subword_tokenizer_type character \
--mecab_dic_type unidic_lite \
--vocab_has_no_subword_prefix \
--do_whole_word_mask \
--gzip_compress \
--use_v2_feature_names \
--max_seq_length 128 \
--max_predictions_per_seq 19 \
--masked_lm_prob 0.15 \
--dupe_factor 5

# Character on Wikipedia
# Each process takes about 10h30m and 205GB RAM, producing 23.7M instances
$ mkdir -p $WORK_DIR/pretraining_data/character_unidic_lite/wikipedia
$ seq -f %02g 1 8|xargs -L 1 -I {} -P 1 \
python create_pretraining_data.py \
--input_file $WORK_DIR/corpus/wikipedia/corpus_{}.txt \
--output_file $WORK_DIR/pretraining_data/character_unidic_lite/wikipedia/pretraining_data_{}.tfrecord.gz \
--vocab_file $WORK_DIR/tokenizers/character_unidic_lite/vocab.txt \
--word_tokenizer_type mecab \
--subword_tokenizer_type character \
--mecab_dic_type unidic_lite \
--vocab_has_no_subword_prefix \
--do_whole_word_mask \
--gzip_compress \
--use_v2_feature_names \
--max_seq_length 512 \
--max_predictions_per_seq 76 \
--masked_lm_prob 0.15 \
--dupe_factor 30

Training

We trained the models first on the CC-100 corpus and then on the Wikipedia corpus. Generally speaking, the texts of Wikipedia are much cleaner than those of CC-100, but the amount of text is much smaller. We expect that our two-stage training scheme let the model trained on large amount of text while preserving the quality of language that the model eventually learns.

For training of the MLM (masked language modeling) objective, we introduced whole word masking in which all subword tokens corresponding to a single word (tokenized by MeCab) are masked at once.

To conduct training of each model, we used a v3-8 instance of Cloud TPUs provided by TensorFlow Research Cloud program. The training took about 16 and 56 days for BERT-base and BERT-large models, respectively.

Creating a TPU VM and connecting to it

Note: We set the runtime version of the TPU as 2.11.0, where TensorFlow v2.11 is used. It is important to specify the same version if you wish to reuse our codes, otherwise it may not work properly.

Here we use Google Cloud CLI.

$ gcloud compute tpus tpu-vm create <TPU_NODE_ID> --zone=<TPU_ZONE> --accelerator-type=v3-8 --version=tpu-vm-tf-2.11.0
$ gcloud compute tpus tpu-vm ssh <TPU_NODE_ID> --zone=<TPU_ZONE>

Training of the models

The following commands are executed in the TPU VM. It is recommended that you run the commands in a Tmux session.

Note: All the necessary files (i.e., pretraining data and config files) need to be stored in a Google Cloud Storage (GCS) bucket in advance.

BERT-base, WordPiece

(vm)$ cd /usr/share/tpu/models/
(vm)$ pip3 install -r official/requirements.txt
(vm)$ export PYTHONPATH=/usr/share/tpu/models
(vm)$ CONFIG_DIR="gs://<GCS_BUCKET_ID>/bert-japanese/configs"
(vm)$ DATA_DIR="gs://<GCS_BUCKET_ID>/bert-japanese/pretraining_data/wordpiece_unidic_lite"
(vm)$ MODEL_DIR="gs://<GCS_BUCKET_ID>/bert-japanese/model/wordpiece_unidic_lite"

# Start training on CC-100
# It will take 6 days to finish on a v3-8 TPU
(vm)$ python3 official/nlp/train.py \
--tpu=local \
--experiment=bert/pretraining \
--mode=train_and_eval \
--model_dir=$MODEL_DIR/bert_base/training/cc-100 \
--config_file=$CONFIG_DIR/data/cc-100.yaml \
--config_file=$CONFIG_DIR/model/bert_base_wordpiece.yaml \
--params_override="task.train_data.input_path=$DATA_DIR/cc-100/pretraining_data_*.tfrecord,task.validation_data.input_path=$DATA_DIR/cc-100/pretraining_data_*.tfrecord,runtime.distribution_strategy=tpu"

# Continue training on Wikipedia
# It will take 10 days to finish on a v3-8 TPU
(vm)$ python3 official/nlp/train.py \
--tpu=local \
--experiment=bert/pretraining \
--mode=train_and_eval \
--model_dir=$MODEL_DIR/bert_base/training/cc-100_wikipedia \
--config_file=$CONFIG_DIR/data/wikipedia.yaml \
--config_file=$CONFIG_DIR/model/bert_base_wordpiece.yaml \
--params_override="task.init_checkpoint=$MODEL_DIR/bert_base/training/cc-100,task.train_data.input_path=$DATA_DIR/wikipedia/pretraining_data_*.tfrecord,task.validation_data.input_path=$DATA_DIR/wikipedia/pretraining_data_*.tfrecord,runtime.distribution_strategy=tpu"

BERT-base, Character

(vm)$ cd /usr/share/tpu/models/
(vm)$ pip3 install -r official/requirements.txt
(vm)$ export PYTHONPATH=/usr/share/tpu/models
(vm)$ CONFIG_DIR="gs://<GCS_BUCKET_ID>/bert-japanese/configs"
(vm)$ DATA_DIR="gs://<GCS_BUCKET_ID>/bert-japanese/pretraining_data/character_unidic_lite"
(vm)$ MODEL_DIR="gs://<GCS_BUCKET_ID>/bert-japanese/model/character_unidic_lite"

# Start training on CC-100
# It will take 6 days to finish on a v3-8 TPU
(vm)$ python3 official/nlp/train.py \
--tpu=local \
--experiment=bert/pretraining \
--mode=train_and_eval \
--model_dir=$MODEL_DIR/bert_base/training/cc-100 \
--config_file=$CONFIG_DIR/data/cc-100.yaml \
--config_file=$CONFIG_DIR/model/bert_base_character.yaml \
--params_override="task.train_data.input_path=$DATA_DIR/cc-100/pretraining_data_*.tfrecord,task.validation_data.input_path=$DATA_DIR/cc-100/pretraining_data_*.tfrecord,runtime.distribution_strategy=tpu"

# Continue training on Wikipedia
# It will take 10 days to finish on a v3-8 TPU
(vm)$ python3 official/nlp/train.py \
--tpu=local \
--experiment=bert/pretraining \
--mode=train_and_eval \
--model_dir=$MODEL_DIR/bert_base/training/cc-100_wikipedia \
--config_file=$CONFIG_DIR/data/wikipedia.yaml \
--config_file=$CONFIG_DIR/model/bert_base_character.yaml \
--params_override="task.init_checkpoint=$MODEL_DIR/bert_base/training/cc-100,task.train_data.input_path=$DATA_DIR/wikipedia/pretraining_data_*.tfrecord,task.validation_data.input_path=$DATA_DIR/wikipedia/pretraining_data_*.tfrecord,runtime.distribution_strategy=tpu"

BERT-large, WordPiece

(vm)$ cd /usr/share/tpu/models/
(vm)$ pip3 install -r official/requirements.txt
(vm)$ export PYTHONPATH=/usr/share/tpu/models
(vm)$ CONFIG_DIR="gs://<GCS_BUCKET_ID>/bert-japanese/configs"
(vm)$ DATA_DIR="gs://<GCS_BUCKET_ID>/bert-japanese/pretraining_data/wordpiece_unidic_lite"
(vm)$ MODEL_DIR="gs://<GCS_BUCKET_ID>/bert-japanese/model/wordpiece_unidic_lite"

# Start training on CC-100
# It will take 23 days to finish on a v3-8 TPU
(vm)$ python3 official/nlp/train.py \
--tpu=local \
--experiment=bert/pretraining \
--mode=train_and_eval \
--model_dir=$MODEL_DIR/bert_large/training/cc-100 \
--config_file=$CONFIG_DIR/data/cc-100.yaml \
--config_file=$CONFIG_DIR/model/bert_large_wordpiece.yaml \
--params_override="task.train_data.input_path=$DATA_DIR/cc-100/pretraining_data_*.tfrecord,task.validation_data.input_path=$DATA_DIR/cc-100/pretraining_data_*.tfrecord,runtime.distribution_strategy=tpu"

# Continue training on Wikipedia
# It will take 33 days to finish on a v3-8 TPU
(vm)$ python3 official/nlp/train.py \
--tpu=local \
--experiment=bert/pretraining \
--mode=train_and_eval \
--model_dir=$MODEL_DIR/bert_large/training/cc-100_wikipedia \
--config_file=$CONFIG_DIR/data/wikipedia.yaml \
--config_file=$CONFIG_DIR/model/bert_large_wordpiece.yaml \
--params_override="task.init_checkpoint=$MODEL_DIR/bert_large/training/cc-100,task.train_data.input_path=$DATA_DIR/wikipedia/pretraining_data_*.tfrecord,task.validation_data.input_path=$DATA_DIR/wikipedia/pretraining_data_*.tfrecord,runtime.distribution_strategy=tpu"

BERT-large, Character

(vm)$ cd /usr/share/tpu/models/
(vm)$ pip3 install -r official/requirements.txt
(vm)$ export PYTHONPATH=/usr/share/tpu/models
(vm)$ CONFIG_DIR="gs://<GCS_BUCKET_ID>/bert-japanese/configs"
(vm)$ DATA_DIR="gs://<GCS_BUCKET_ID>/bert-japanese/pretraining_data/character_unidic_lite"
(vm)$ MODEL_DIR="gs://<GCS_BUCKET_ID>/bert-japanese/model/character_unidic_lite"

# Start training on CC-100
# It will take 23 days to finish on a v3-8 TPU
(vm)$ python3 official/nlp/train.py \
--tpu=local \
--experiment=bert/pretraining \
--mode=train_and_eval \
--model_dir=$MODEL_DIR/bert_large/training/cc-100 \
--config_file=$CONFIG_DIR/data/cc-100.yaml \
--config_file=$CONFIG_DIR/model/bert_large_character.yaml \
--params_override="task.train_data.input_path=$DATA_DIR/cc-100/pretraining_data_*.tfrecord,task.validation_data.input_path=$DATA_DIR/cc-100/pretraining_data_*.tfrecord,runtime.distribution_strategy=tpu"

# Continue training on Wikipedia
# It will take 33 days to finish on a v3-8 TPU
(vm)$ python3 official/nlp/train.py \
--tpu=local \
--experiment=bert/pretraining \
--mode=train_and_eval \
--model_dir=$MODEL_DIR/bert_large/training/cc-100_wikipedia \
--config_file=$CONFIG_DIR/data/wikipedia.yaml \
--config_file=$CONFIG_DIR/model/bert_large_character.yaml \
--params_override="task.init_checkpoint=$MODEL_DIR/bert_large/training/cc-100,task.train_data.input_path=$DATA_DIR/wikipedia/pretraining_data_*.tfrecord,task.validation_data.input_path=$DATA_DIR/wikipedia/pretraining_data_*.tfrecord,runtime.distribution_strategy=tpu"

Deleting a TPU VM

$ gcloud compute tpus tpu-vm delete <TPU_NODE_ID> --zone=<TPU_ZONE>

Model Conversion

You can convert the TensorFlow model checkpoint to a PyTorch model file.

Note: The model conversion script is designed for the models trained with TensorFlow v2.11.0. The script may not work for models trained with a different version of TensorFlow.

# For BERT-base, WordPiece
$ VOCAB_FILE=$WORK_DIR/tokenizers/wordpiece_unidic_lite/vocab.txt
$ TF_CONFIG_FILE=model_configs/bert_base_wordpiece/config.json
$ HF_CONFIG_DIR=hf_model_configs/bert_base_wordpiece
$ TF_CKPT_PATH=$WORK_DIR/model/wordpiece_unidic_lite/bert_base/training/cc-100_wikipedia/ckpt-1000000
$ OUTPUT_DIR=$WORK_DIR/hf_model/bert-base-japanese-v3

# For BERT-base, Character
$ VOCAB_FILE=$WORK_DIR/tokenizers/character_unidic_lite/vocab.txt
$ TF_CONFIG_FILE=model_configs/bert_base_character/config.json
$ HF_CONFIG_DIR=hf_model_configs/bert_base_character
$ TF_CKPT_PATH=$WORK_DIR/model/character_unidic_lite/bert_base/training/cc-100_wikipedia/ckpt-1000000
$ OUTPUT_DIR=$WORK_DIR/hf_model/bert-base-japanese-char-v3

# For BERT-large, WordPiece
$ VOCAB_FILE=$WORK_DIR/tokenizers/wordpiece_unidic_lite/vocab.txt
$ TF_CONFIG_FILE=model_configs/bert_large_wordpiece/config.json
$ HF_CONFIG_DIR=hf_model_configs/bert_large_wordpiece
$ TF_CKPT_PATH=$WORK_DIR/model/wordpiece_unidic_lite/bert_large/training/cc-100_wikipedia/ckpt-1000000
$ OUTPUT_DIR=$WORK_DIR/hf_model/bert-large-japanese-v2

# For BERT-large, Character
$ VOCAB_FILE=$WORK_DIR/tokenizers/character_unidic_lite/vocab.txt
$ TF_CONFIG_FILE=model_configs/bert_large_character/config.json
$ HF_CONFIG_DIR=hf_model_configs/bert_large_character
$ TF_CKPT_PATH=$WORK_DIR/model/character_unidic_lite/bert_large/training/cc-100_wikipedia/ckpt-1000000
$ OUTPUT_DIR=$WORK_DIR/hf_model/bert-large-japanese-char-v2

# Run the model conversion script
$ mkdir -p $OUTPUT_DIR
$ python convert_tf2_checkpoint_to_all_frameworks.py \
--tf_checkpoint_path $TF_CKPT_PATH \
--tf_config_file $TF_CONFIG_FILE \
--output_path $OUTPUT_DIR
$ cp $HF_CONFIG_DIR/* $OUTPUT_DIR
$ cp $VOCAB_FILE $OUTPUT_DIR

Model Performances

We evaluated our models' performances on the JGLUE benchmark tasks.

For each task, the model is fine-tuned on the training set and evaluated on the development set (the test sets are not publicly available as of this writing.) The hyperparameters were searched within the same set of values as the ones specified in the JGLUE fine-tuning README.

The results of our (informal) experiments are below. Note: These results should be viewed as informative only, since each setting was experimented with only one fixed random seed.

Model MARC-ja JSTS JNLI JSQuAD JCommonsenseQA
Acc. Pearson / Spearman Acc. EM / F1 Acc.
bert-base-japanese-v2 0.958 0.910 / 0.871 0.901 0.869 / 0.939 0.803
bert-base-japanese-v3 New! 0.962 0.919 / 0.881 0.907 0.880 / 0.946 0.848
bert-base-japanese-char-v2 0.957 0.891 / 0.851 0.896 0.870 / 0.938 0.724
bert-base-japanese-char-v3 New! 0.959 0.914 / 0.875 0.903 0.871 / 0.939 0.786
bert-large-japanese 0.958 0.913 / 0.874 0.902 0.881 / 0.946 0.823
bert-large-japanese-v2 New! 0.960 0.926 / 0.893 0.929 0.893 / 0.956 0.893
bert-large-japanese-char 0.958 0.883 / 0.842 0.899 0.870 / 0.938 0.753
bert-large-japanese-char-v2 New! 0.961 0.921 / 0.884 0.910 0.892 / 0.952 0.859

Licenses

The pretrained models and the codes in this repository are distributed under the Apache License 2.0.

Related Work

Acknowledgments

The distributed models are trained with Cloud TPUs provided by TPU Research Cloud program.

More Repositories

1

ILYS-aoba-chatbot

Python
22
star
2

keigo_transfer_task

ๆ•ฌ่ชžๅค‰ๆ›ใ‚ฟใ‚นใ‚ฏใซใŠใ‘ใ‚‹่ฉ•ไพก็”จใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆ
20
star
3

JAQKET_baseline

Python
16
star
4

AIO2_DPR_baseline

https://www.nlp.ecei.tohoku.ac.jp/projects/aio/
Python
16
star
5

PheMT

A phenomenon-wise evaluation dataset for Japanese-English machine translation robustness. The dataset is based on the MTNT dataset, with additional annotations of four linguistic phenomena; Proper Noun, Abbreviated Noun, Colloquial Expression, and Variant. COLING 2020.
Python
13
star
6

phsic

Compute co-occurrence strength between sentences with short learning time
Python
8
star
7

eval-via-selection

Python
7
star
8

showcase

A PyTorch implementation of the Japanese Predicate-Argument Structure (PAS) analyser presented in the paper of Matsubayashi & Inui (2018) with some improvements.
Python
6
star
9

elmo-japanese

Python
5
star
10

BPersona-chat

waiting for update
5
star
11

Multi-VidSum

Python
5
star
12

xbe

Cross-stitch bi-encoder for distantly supervised relation extraction (EMNLP 2022)
Python
4
star
13

TYPIC

4
star
14

J-UniMorph

Dataset of UniMorph in Japanese
JavaScript
4
star
15

JAOJ

Japanese Argument Omission Judgement
3
star
16

AIO3_FiD_baseline

A baseline QA system for AIO3 competition utilizing Dense Passage Retrieval (DPR) and Fusion-in-Decoder (FiD)
Python
3
star
17

AIO3_BPR_baseline

A baseline QA system for AIO3 competition utilizing Binary Passage Retriever (BPR)
Python
3
star
18

CORD-19_phrase_embeddings

Phrase embeddings trained on the CORD-19 collection of COVID-19 research papers
Python
3
star
19

aio4-bpr-baseline

Python
2
star
20

AIO4_GPT_baseline

Python
2
star
21

aio4-fid-baseline

็ฌฌ4ๅ›žAI็Ž‹ ๆ—ฉๆŠผใ—่งฃ็ญ”้ƒจ้–€ Fusion-in-Decoder(FiD)ใƒ™ใƒผใ‚นใƒฉใ‚คใƒณ
Python
2
star
22

LPAttack

LPAttack: A Feasible Annotation Scheme for Capturing Logic Pattern of Attacks in Arguments
2
star
23

trace-manipulate

Python
1
star
24

scientific-paper-pros-cons

1
star
25

IRAC_2022

A knowledge base of domain-specific implicit reasonings in arguments via causality
1
star
26

quiz-datasets

QA datasets created from Japanese quiz questions.
Python
1
star
27

hagi-bot

ASP.NET
1
star