• Stars
    star
    219
  • Rank 180,079 (Top 4 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 4 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Platform for few-shot natural language processing: Text Classification, Sequene Labeling.

Meta Dialog Platform (MDP)

Meta Dialog Platform: a toolkit platform for NLP Few-Shot Learning tasks of:

  • Text Classification
  • Sequence Labeling

It also provides the baselines for:

Updates

Features

State-of-the-art solutions for Few-shot NLP:

Easy-to-start & flexible framework:

  • Provide tools for easy training & testing.
  • Support various few-shot models with unified and extendable interfaces, such as ProtoNet and TapNet.
  • Support easy-to-switch similarity-metrics and logits-scaling methods.
  • Provide tools of generating episode-style data for meta-learning.

Citation

Please cite code and data:

@article{hou2020fewjoint,
	title={FewJoint: A Few-shot Learning Benchmark for Joint Language Understanding},
	author={Yutai Hou, Jiafeng Mao, Yongkui Lai, Cheng Chen, Wanxiang Che, Zhigang Chen, Ting Liu},
	journal={arXiv preprint},
	year={2020}
}

Get Started

Environment Requirement

python>=3.6
torch>=1.2.0
transformers>=2.9.0
numpy>=1.17.0
tqdm>=4.31.1
allennlp>=0.8.4
pytorch-nlp

Example for Sequence Labeling

Here, we take the few-shot slot tagging and NER task from (Hou et al., 2020) as quick start examples.

Step1: Prepare pre-trained embedding

  • Download the pytorch bert model, or convert tensorflow param by yourself with scripts.
  • Set BERT path in the ./scripts/run_1_shot_slot_tagging.sh to your setting:
bert_base_uncased=/your_dir/uncased_L-12_H-768_A-12/
bert_base_uncased_vocab=/your_dir/uncased_L-12_H-768_A-12/vocab.txt

Step2: Prepare data

  • Download the compatible few-shot data at here: download

  • Set test, train, dev data file path in ./scripts/run_1_shot_slot_tagging.sh to your setting.

For simplicity, your only need to set the root path for data as follow:

base_data_dir=/your_dir/ACL2020data/

Step3: Train and test the main model

  • Build a folder to collect running log
mkdir result
  • Execute cross-evaluation script with two params: -[gpu id] -[dataset name]
Example for 1-shot slot tagging:
source ./scripts/run_1_shot_slot_tagging.sh 0 snips
Example for 1-shot NER:
source ./scripts/run_1_shot_slot_tagging.sh 0 ner

To run 5-shots experiments, use ./scripts/run_5_shot_slot_tagging.sh

Other detailed functions and options:

You can experiment freely by passing parameters to main.py to choose different model architectures, hyperparameters, etc.

To view detailed options and corresponding descriptions, run commandline:

python main.py --h

We provide scripts for general few-shot classification and sequence labeling task respectively:

  • classification
    • run_electra_sc.sh
    • run_bert_sc.sh
  • sequence labeling
    • run_electra_sl.sh
    • run_bert_sl.sh

The usage of these scripts are similar to process in Get Started.

Run with FewJoint/SMP data

  • Get reformatted FewJoint data at here or construct episode-style data by yourself with our tool.
  • Use script ./scripts/run_smp_bert_sc.sh and ./scripts/run_smp_bert_sl.sh to perform few-shot intent detection or few-shot slot filling respectively.
  • Notice that:
    1. Change train/dev/test path in the scripts before running.
    2. Find predicted results at trained_model_path within running scripts.

Few-shot Data Construction Tool

We also provide a generation tool for converting normal data into few-shot/meta-episode style. The tool is included at path: scripts/other_tool/meta_dataset_generator.py.

Run following commandline to view detailed interface:

python generate_meta_dataset.py --h

For simplicity, we provide an example script to help generate few-shot data: ./scripts/gen_meta_data.sh.

The following are some key params for you to control the generation process:

  • input_dir: raw data path
  • output_dir: output data path
  • episode_num: the number of episode which you want to generate
  • support_shots_lst: to specified the support shot size in each episode, we can specified multiple number to generate at the same time.
  • query_shot: to specified the query shot size in each episode
  • seed_lst: random seed list to control random generation
  • use_fix_support: set the fix support in dev dataset
  • dataset_lst: specified the dataset type which our tool can handle, there are some choices: stanford & SLU & TourSG & SMP.

If you want to handle other type of dataset, you can add your code for load raw dataset in meta_dataset_generator/raw_data_loader.py.

few-shot/meta-episode style data example
{
  "domain_name": [
    {  // episode
      "support": {  // support set
        "seq_ins": [["we", "are", "friends", "."], ["how", "are", "you", "?"]],  // input sequence
        "seq_outs": [["O", "O", "O", "O"], ["O", "O", "O", "O"]],  // output sequence in sequence labeling task
        "labels": [["statement"], ["query"]]  // output labels in classification task
      },
      "query": {  // query set
        "seq_ins": [["we", "are", "friends", "."], ["how", "are", "you", "?"]],
        "seq_outs": [["O", "O", "O", "O"], ["O", "O", "O", "O"]],
        "labels": [["statement"], ["query"]]
      }
    },
    ...
  ],
  ...
}

Acknowledgment

The platform is developed by HIT-SCIR. If you have any question and advice for it, please contact us(Yutai Hou - [email protected] or Yongkui Lai - [email protected]).

License for code and data

Apache License 2.0

More Repositories

1

Task-Oriented-Dialogue-Research-Progress-Survey

A datasets and methods survey about task-oriented dialogue, including recent datasets and SOTA leaderboards.
1,239
star
2

FewShotTagging

Code for ACL2020 paper: Few-shot Slot Tagging with Collapsed Dependency Transfer and Label-enhanced Task-adaptive Projection Network
Python
153
star
3

FewShotMultiLabel

Code for AAAI2021 paper: Few-Shot Learning for Multi-label Intent Detection.
Python
103
star
4

Seq2SeqDataAugmentationForLU

This repo is code for the COLING 2018 paper: Sequence-to-sequence Data Augmentation for Dialogue Language Understanding.
Python
77
star
5

PromptSlotTagging

Code for ACL22 findings paper: Inverse is Better! Fast and Accurate Prompt for Slot Tagging
Python
26
star
6

Pascal-Simple-Compiler

哈工大编译原理实验 编译器
C++
16
star
7

Bi-LSTM_PosTagger

An easy-to-use sequence labeling project(get SoA on ATIS data) with pytorch
Python
15
star
8

FewShotJoint

Python
12
star
9

UserSimulator

Code for CCL 2019 BEST Poster Paper: A Corpus-free State2Seq User Simulator for Task-oriented Dialogue
OpenEdge ABL
7
star
10

Sequence-to-Sequence-User-Simulator

Python
6
star
11

atma

Light NLP Tool: atma-0.4.1, commonly-used & tested NLP tools: sentence level bleu, tokenizer, proxy crawler included
Python
6
star
12

HIT-secondary-trading-platform-

哈工大二手交易平台
HTML
6
star
13

NNlearning

This repo contains code solution for Stanford CS224N natural language class, and some project relative code.
Python
2
star
14

simple_BP

A simple 3-3-4 three layer neural network for HIT PR homework2
Python
2
star
15

BigPrimeNumber

MSRA homework, implement a big integer packet and generate the k th prime.
C++
1
star
16

Parzen-Window-And-Top-K

Machine Learning non-parameter method: Parzen window and Top-k
Python
1
star
17

git_homework

A good example of Django development ,which is HIT Software Engineering's homework
1
star
18

scene2

for homework
Python
1
star
19

CruelWorld

Game & Machine-Learning attempt?
Python
1
star
20

scene1

for homework scene
HTML
1
star