JGLUE: Japanese General Language Understanding Evaluation

JGLUE, Japanese General Language Understanding Evaluation, is built to measure the general NLU ability in Japanese. JGLUE has been constructed from scratch without translation. We hope that JGLUE will facilitate NLU research in Japanese.

JGLUE has been constructed by a joint research project of Yahoo Japan Corporation and Kawahara Lab at Waseda University.

Tasks/Datasets

JGLUE consists of the tasks of text classification, sentence pair classification, and QA. Each task consists of multiple datasets. Each dataset can be found under the datasets directory. Only train/dev sets are available now, and the test set will be available after the leaderboard is made public. We use Yahoo! Crowdsourcing for all crowdsourcing tasks in constructing the datasets.

Task	Dataset	Train	Dev	Test
Text Classification	MARC-ja	187,528	5,654	5,639
	JCoLA†	-	-	-
Sentence Pair Classification	JSTS	12,451	1,457	1,589
	JNLI	20,073	2,434	2,508
QA	JSQuAD	62,859	4,442	4,420
	JCommonsenseQA	8,939	1,119	1,118

†JCoLA will be added soon.

Dataset Description

MARC-ja

MARC-ja is a dataset of the text classification task. This dataset is based on the Japanese portion of Multilingual Amazon Reviews Corpus (MARC) (Keung+, 2020).

We performed the following modifications to the original dataset:

To make it easy for both humans and computers to judge a class label, we cast the text classification task as a binary classification task, where 1 and 2-star ratings are converted to negative, and 4 and 5 are converted to positive. We do not use reviews with a 3-star rating.
There are some instances where the rating diverges from a review text. To improve the quality of the dev/test instances, we crowdsource a positive/negative judgment task, adopt only the reviews with the same votes from seven or more out of 10 workers and assign a label of the maximum votes to these reviews.

We don't distribute the dataset itself. Please download the original dataset, and run a conversion script as follows:

Download https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_multilingual_JP_v1_00.tsv.gz
Run the following commands:

$ pip install -r preprocess/requirements.txt
$ cd preprocess/marc-ja/scripts
$ gzip -dc /somewhere/amazon_reviews_multilingual_JP_v1_00.tsv.gz | \
  python marc-ja.py \
         --positive-negative \
         --output-dir ../../../datasets/marc_ja-v1.1 \
         --max-char-length 500 \
         --filter-review-id-list-valid ../data/filter_review_id_list/valid.txt \
         --label-conv-review-id-list-valid ../data/label_conv_review_id_list/valid.txt

The train and valid sets will be generated under the datasets/marc_ja-v1.1 directory.

When you use this dataset, please follow the license of Multilingual Amazon Reviews Corpus (MARC).

JSTS

JSTS is a Japanese version of the STS (Semantic Textual Similarity) dataset. STS is a task to estimate the semantic similarity of a sentence pair. The sentences in JSTS and JNLI (described below) are extracted from the Japanese version of the MS COCO Caption Dataset, the YJ Captions Dataset (Miyazaki and Shimizu, 2016).

{"sentence_pair_id": "691",
 "yjcaptions_id": "127202-129817-129818",
 "sentence1": "街中の道路を大きなバスが走っています。 (A big bus is running on the road in the city.)", 
 "sentence2": "道路を大きなバスが走っています。 (There is a big bus running on the road.)", 
 "label": 4.4}

(Note that English translations are added in this example for those who do not understand Japanese, and are not included in the dataset.)

Name	Description
sentence_pair_id	id
yjcaptions_id	sentence ids in yjcaptions (explained below)
sentence1	first sentence
sentence2	second sentence
label	sentence similarity: 5 (equivalent meaning) - 0 (completely different meaning)

Explanation for `yjcaptions_id`

There are the following two cases:

sentence pairs in one image: (image id)-(sentence1 id)-(sentence2 id)
- e.g., 723-844-847
- a sentence id starting with "g" means a sentence generated by a crowdworker (e.g., 69501-75698-g103): only for JNLI
sentence pairs in two images: (image id of sentence1)_(image id of sentence2)-(sentence1 id)-(sentence2 id)
- e.g., 91337_217583-96105-91680

JNLI

JNLI is a Japanese version of the NLI (Natural Language Inference) dataset. NLI is a task to recognize the inference relation that a premise sentence has to a hypothesis sentence. The inference relations are entailment, contradiction, and neutral.

{"sentence_pair_id": "1157",
 "yjcaptions_id": "127202-129817-129818",
 "sentence1": "街中の道路を大きなバスが走っています。 (A big bus is running on the road in the city.)", 
 "sentence2": "道路を大きなバスが走っています。 (There is a big bus running on the road.)", 
 "label": "entailment"}

Name	Description
sentence_pair_id	id
yjcaptions_id	sentence ids in yjcaptions
sentence1	premise sentence
sentence2	hypothesis sentence
label	inference relation

JSQuAD

JSQuAD is a Japanese version of SQuAD (Rajpurkar+, 2016), one of the datasets of reading comprehension. Each instance in the dataset consists of a question regarding a given context (Wikipedia article) and its answer. JSQuAD is based on SQuAD 1.1 (there are no unanswerable questions). We used the Japanese Wikipedia dump as of 20211101.

The json format is the same as the original SQuAD.

    {
      "title": "東海道新幹線 (Tokaido Shinkansen)",
      "paragraphs": [
        {
          "qas": [
            {
              "question": "2020年（令和2年）3月現在、東京駅 - 新大阪駅間の最高速度はどのくらいか。 (What is the maximum speed between Tokyo Station and Shin-Osaka Station as of March 2020?)",
              "id": "a1531320p0q0",
              "answers": [
                {
                  "text": "285 km/h",
                  "answer_start": 182
                }
              ],
              "is_impossible": false
            },
            {
             .. 
            }
          ],
          "context": "東海道新幹線 [SEP] 1987年（昭和62年）4月1日の国鉄分割民営化により、JR東海が運営を継承した。西日本旅客鉄道（JR西日本）が継承した山陽新幹線とは相互乗り入れが行われており、東海道新幹線区間のみで運転される列車にもJR西日本所有の車両が使用されることがある。2020年（令和2年）3月現在、東京駅 - 新大阪駅間の所要時間は最速2時間21分、最高速度285 km/hで運行されている。"
        }
      ]
    }

Name	Description
title	title of a Wikipedia article
paragraphs	a set of paragraphs
qas	a set of pairs of a question and its answer
question	question
id	id of a question
answers	a set of answers
text	answer text
answer_start	start position (character index)
is_impossible	all the values are `false`
context	a concatenation of the title and paragraph

JCommonsenseQA

JCommonsenseQA is a Japanese version of CommonsenseQA (Talmor+, 2019), which is a multiple-choice question answering dataset that requires commonsense reasoning ability. It is built using crowdsourcing with seeds extracted from the knowledge base ConceptNet.

{"q_id": 3016,
 "question": "会社の最高責任者を何というか？ (What do you call the chief executive officer of a company?)",
 "choice0": "社長 (president)",
 "choice1": "教師 (teacher)",
 "choice2": "部長 (manager)",
 "choice3": "バイト (part-time worker)",
 "choice4": "部下 (subordinate)",
 "label": 0}

Name	Description
q_id	id
question	question
choice{0..4}	choice
label	correct choice id

Baseline Scores

The following foundation models are used for the evaluation.

Model	Basic Unit	Pretraining Texts
Tohoku BERT base	subword (MeCab + BPE)	Japanese Wikipedia
Tohoku BERT base (char)	character	Japanese Wikipedia
NICT BERT base	subword (MeCab + BPE)	Japanese Wikipedia
Waseda RoBERTa base	subword (Juman++ + Unigram LM)	Japanese Wikipedia + CC
XLM RoBERTa base	subword (Unigram LM)	multi-lingual CC

Note that the large-sized models are also used corresponding to Tohoku BERT base, Waseda RoBERTa base and XLM RoBERTa base. For Waseda RoBERTa large, the following two versions with different maximum sequence lengths are used: Waseda RoBERTa large (s128) and Waseda RoBERTa large (s512).

When you use NICT BERT base or Waseda RoBERTa base models, the dataset text should be segmented into words by the following corresponding morphological analyzer in advance:

NICT BERT base: MeCab (0.996) with JUMAN dictionary
Waseda RoBERTa base: Juman++ (2.0.0-rc3)

Please refer to preprocess/morphological-analysis/README.md.

The fine-tuning was performed using the transformers library provided by Hugging Face. See fine-tuning/README.md for details.

The performance along with human scores on the JGLUE dev set is shown below.

Model	MARC-ja	JSTS	JNLI	JSQuAD	JCommonsenseQA
	acc	Pearson/Spearman	acc	EM/F1	acc
Human	0.989	0.899/0.861	0.925	0.871/0.944	0.986

Tohoku BERT base	0.958	0.909/0.868	0.899	0.871/0.941	0.808
Tohoku BERT base (char)	0.956	0.893/0.851	0.892	0.864/0.937	0.718
Tohoku BERT large	0.955	0.913/0.872	0.900	0.880/0.946	0.816
NICT BERT base	0.958	0.910/0.871	0.902	0.897/0.947	0.823
Waseda RoBERTa base	0.962	0.913/0.873	0.895	0.864/0.927	0.840
Waseda RoBERTa large (s128)	0.954	0.930/0.896	0.924	0.884/0.940	0.907
Waseda RoBERTa large (s512)	0.961	0.926/0.892	0.926	0.918/0.963	0.891
XLM RoBERTa base	0.961	0.877/0.831	0.893	-/-†	0.687
XLM RoBERTa large	0.964	0.918/0.884	0.919	-/-†	0.840

†XLM RoBERTa base/large models use the unigram language model as a tokenizer and they are excluded from the JSQuAD evaluation because the token delimitation and the start/end of the answer span often do not match, resulting in poor performance.

Leaderboard

A leaderboard will be made public soon. The test set will be released at that time.

Reference

@inproceedings{kurihara-etal-2022-jglue,
    title = "{JGLUE}: {J}apanese General Language Understanding Evaluation",
    author = "Kurihara, Kentaro  and
      Kawahara, Daisuke  and
      Shibata, Tomohide",
    booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
    month = jun,
    year = "2022",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2022.lrec-1.317",
    pages = "2957--2966",
    abstract = "To develop high-performance natural language understanding (NLU) models, it is necessary to have a benchmark to evaluate and analyze NLU ability from various perspectives. While the English NLU benchmark, GLUE, has been the forerunner, benchmarks are now being released for languages other than English, such as CLUE for Chinese and FLUE for French; but there is no such benchmark for Japanese. We build a Japanese NLU benchmark, JGLUE, from scratch without translation to measure the general NLU ability in Japanese. We hope that JGLUE will facilitate NLU research in Japanese.",
}

@InProceedings{Kurihara_nlp2022,
  author = 	"栗原健太郎 and 河原大輔 and 柴田知秀",
  title = 	"JGLUE: 日本語言語理解ベンチマーク",
  booktitle = 	"言語処理学会第28回年次大会",
  year =	"2022",
  url = "https://www.anlp.jp/proceedings/annual_meeting/2022/pdf_dir/E8-4.pdf"
  note= "in Japanese"
}

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Contributor License Agreement

This project requires contributors to accept the terms in the Contributor License Agreement (CLA).

Please note that contributors to the JGLUE repository on GitHub (https://github.com/yahoojapan/JGLUE) shall be deemed to have accepted the CLA without individual written agreements.

yahoojapan/JGLUE

yahoojapan

Reviews

Repository Details

JGLUE: Japanese General Language Understanding Evaluation

Tasks/Datasets

Dataset Description

MARC-ja

JSTS

Explanation for `yjcaptions_id`

JNLI

JSQuAD

JCommonsenseQA

Baseline Scores

Leaderboard

Reference

License

Contributor License Agreement

More Repositories

yahoojapan/JGLUE

yahoojapan

Reviews

Repository Details

JGLUE: Japanese General Language Understanding Evaluation

Tasks/Datasets

Dataset Description

MARC-ja

JSTS

Explanation for yjcaptions_id

JNLI

JSQuAD

JCommonsenseQA

Baseline Scores

Leaderboard

Reference

License

Contributor License Agreement

More Repositories

Explanation for `yjcaptions_id`