rasa-nlu-benchmark
Collection of dataset and corresponding benchmark for Rasa NLU
Introduction
Rasa NLU is a powerful and open-source natural language processing tool for intent classification and entity extraction in chatbots.
However, we found that there is no published public dataset and the corresponding benchmark. This makes it difficult to evaluate the performance of our own NLU system built by Rasa.
Therefore, this project aims to collect and organize datasets and baselines for Task-Oriented Dialogue, which will be in the data format required by Rasa NLU and you can directly use them in your Rasa NLU system.
Datasets
All the datasets have been organized and archived in the data
directory
Following information is included for each dataset:
- Name
- Language
- Task
- Size(train/test)
- Intent/Entity Nums
- Link (Website Or Paper)
Name | Language | Task | Size(Train/Test) | Intent/Entity Nums | Link |
---|---|---|---|---|---|
ATIS | en | Airline Travel Information | 4978/893 | 26/129 | more detail |
Snips | en | 7 intents, including:AddToPlaylist, BookRestaurant... | 13802/699 | 7/72 | more detail |
AskUbuntuCorpus | en | 5 intents, questions about Ubuntu | 127/35 | 5/3 | more detail |
Facebook Multilingual Task Oriented Dataset | en | 3 domains, includeing:alarm,weather,remainder | 30521/8621 | 12/25 | more detail |
SMP2019 | zh | 29 domains, including: app, email... | 2063/480 | 24/62 | more detail |
Check flow dataset | zh | 13 intents, some request and inform | 809/210 | 13/6 | more detail |
MSRA_NER | zh | 1 intent, includeing various kinds of news and 3 kinds of entities | 20864/4636 | 1/3 | more detail |
ToutiaoNews | zh | 7 intent, includeing 7 kinds of news | 325279/57409 | 7/0 | more detail |
Note:
- For the SMP2019 and CheckFlow dataset, the official does not divide the training set and test set, we have divided according to 8:2 by ourselves.
Benchmark
Baseline Pipeline
- For English dataset, we use official
pretrained_embeddings_spacy
andsupervised_embeddings
as baseline NLU pipeline. - For Chinese dataset, we use officially recommended Chinese pipeline
rasa_nlu_chi
as baseline NLU pipeline.
Result
Dataset | NLU Pipeline | Intent Classification | Entity Extraction | ||||||
auc | p | r | f1 | auc | p | r | f1 | ||
ATIS(en) | pretrained_embeddings_spacy | 0.91 | 0.91 | 0.91 | 0.91 | 0.98 | 0.98 | 0.98 | 0.98 |
supervised_embeddings | 1.00 | 1.00 | 1.00 | 1.00 | 0.98 | 0.98 | 0.98 | 0.98 | |
Snips(en) | pretrained_embeddings_spacy | 0.99 | 0.99 | 0.99 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 |
supervised_embeddings | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | |
AskUbuntuCorpus(en) | pretrained_embeddings_spacy | 0.89 | 0.89 | 0.89 | 0.89 | 0.95 | 0.95 | 0.95 | 0.95 |
supervised_embeddings | 0.86 | 0.86 | 0.86 | 0.86 | 0.95 | 0.95 | 0.95 | 0.95 | |
Facebook Multilingual Task Oriented Dataset(en) | pretrained_embeddings_spacy | 0.96 | 0.96 | 0.96 | 0.96 | 0.98 | 0.98 | 0.98 | 0.98 |
supervised_embeddings | 0.99 | 0.99 | 0.99 | 0.99 | 0.98 | 0.98 | 0.98 | 0.98 | |
SMP2019(zh) | rasa_nlu_chi | 0.76 | 0.83 | 0.76 | 0.78 | 0.79 | 0.80 | 0.79 | 0.77 |
CheckFlow(zh) | rasa_nlu_chi | 0.95 | 0.95 | 0.95 | 0.94 | 1.00 | 1.00 | 1.00 | 1.00 |
MSRA_NER(zh) | rasa_nlu_chi | N/A | N/A | N/A | N/A | 0.98 | 0.98 | 0.98 | 0.98 |
We feather use Rasa official Comparing NLU Pipelines
tool to compare
pretrained_embeddings_spacy
and supervised_embeddings
on datasets of AskUbuntuCorpus
(small size) and snip
(big size).
We can see that when the training data is relatively small, pretrained_embeddings_spacy
is better, and when the amount of data is sufficient, supervised_embeddings
will be better.