• Stars
    star
    124
  • Rank 278,885 (Top 6 %)
  • Language
    Python
  • License
    MIT License
  • Created over 2 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

[EMNLP 2021] Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification

Text-AutoAugment (TAA)

This repository contains the code for our paper Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification (EMNLP 2021 main conference).

Overview of IAIS

Updates

  • [22.02.23]: We add an example on how to use TAA for your custom (local) dataset.
  • [21.10.27]: We make taa installable as a package and adapt to huggingface/transformers. Now you can search augmentation policy for the huggingface dataset with TWO lines of code.

Quick Links

Overview

  1. We present a learnable and compositional framework for data augmentation. Our proposed algorithm automatically searches for the optimal compositional policy, which improves the diversity and quality of augmented samples.

  2. In low-resource and class-imbalanced regimes of six benchmark datasets, TAA significantly improves the generalization ability of deep neural networks like BERT and effectively boosts text classification performance.

Getting Started

Prepare environment

Install pytorch and other small additional dependencies. Then, install this repo as a python package. Note that cudatoolkit=10.2 should match the CUDA version on your machine.

# Clone this repo
git clone https://github.com/lancopku/text-autoaugment.git
cd text-autoaugment

# Create a conda environment
conda create -n taa python=3.6
conda activate taa

# Install dependencies
pip install torch==1.10.1+cu102 -f https://download.pytorch.org/whl/cu102/torch_stable.html
pip install git+https://github.com/wbaek/theconf
pip install git+https://github.com/ildoonet/pystopwatch2.git
pip install -r requirements.txt

# Install this library (**no need to re-build if the source code is modified**)
python setup.py develop

# Download the models in NLTK
python -c "import nltk; nltk.download('wordnet'); nltk.download('averaged_perceptron_tagger'); nltk.download('omw-1.4')"

Please make sure your Torch supports GPU, check it with the command python -c "import torch; print(torch.cuda.is_available())" (should output True).

Use TAA with Huggingface

1. Get augmented training dataset with TAA policy

Option 1: Search for the optimal policy

You can search for the optimal policy on classification datasets supported by huggingface/datasets:

from taa.search_and_augment import search_and_augment

# return the augmented train dataset in the form of torch.utils.data.Dataset
augmented_train_dataset = search_and_augment(configfile="/path/to/your/config.yaml")

The configfile (YAML file) contains all the arguments including path, model, dataset, optimization hyper-parameter, etc. To successfully run the code, please carefully preset these arguments:

show details
  • model:

    • type: backbone model
  • dataset:

    • path: Path or name of the dataset
    • name: Defining the name of the dataset configuration
    • data_dir: Defining the data_dir of the dataset configuration
    • data_files: Path(s) to source data file(s)

    ATTENTION: All the augments above are used for the load_dataset() function in huggingface/datasets. Please refer to link for details.

    • text_key: Used to get text from a data instance (dict form in huggingface/datasets. See this IMDB example.)
  • abspath: Your working directory

  • aug: Pre-searched policy. Now we support IMDB, SST5, TREC, YELP2 and YELP5. See archive.py.

  • per_device_train_batch_size: Batch size per device for training

  • per_device_eval_batch_size: Batch size per device for evaluation

  • epoch: Training epoch

  • lr: Learning rate

  • max_seq_length

  • n_aug: Augment each text sample n_aug times

  • num_op: Number of operations per sub-policy

  • num_policy: Number of sub-policy per policy

  • method: Search method (taa)

  • topN: Ensemble topN sub-policy to get final policy

  • ir: Imbalance rate

  • seed: Random seed

  • trail: Trail under current random seed

  • train:

    • npc: Number of examples per class in the training dataset
  • valid:

    • npc: Number of examples per class in the val dataset
  • test:

    • npc: Number of examples per class in the test dataset
  • num_search: Number of optimization iteration

  • num_gpus: Number of GPUs used in RAY

  • num_cpus: Number of CPUs used in RAY

configfile example 1: TAA for huggingface dataset

bert_sst2_example.yaml is a configfile example for BERT model and SST2 dataset. You can follow this example to create your own configfile for other huggingface dataset.

For instance, if you only want to change the dataset from sst2 to imdb, just delete the sst2 in the 'path' argument, modify the 'name' to imdb and modity the 'text_key' to text. The result should be like bert_imdb_example.yaml.

configfile example 2: TAA for custom (local) dataset

bert_custom_data_example.yaml is a configfile example for BERT model and custom (local) dataset. The custom dataset should be in the CSV format, and the column name of the data table should be text and label. custom_data.csv is an example of the custom dataset.

WARNING: The policy optimization framework is based on ray. By default we use 4 GPUs and 40 CPUs for policy optimization. Make sure your computing resources meet this condition, or you will need to create a new configuration file. And please specify the gpus, e.g., CUDA_VISIBLE_DEVICES=0,1,2,3 before using the above code. TPU does not seem to be supported now.

Option 2: Use our pre-searched policy

To train a model on the datasets augmented by our pre-searched policy, please use (Take IMDB as an example):

from taa.search_and_augment import augment_with_presearched_policy

# return the augmented train dataset in the form of torch.utils.data.Dataset
augmented_train_dataset = augment_with_presearched_policy(configfile="/path/to/your/config.yaml")

Now we support IMDB, SST5, TREC, YELP2 and YELP5. See archive.py for details.

This table lists the test accuracy (%) of pre-searched TAA policy on full datasets:

Dataset IMDB SST-5 TREC YELP-2 YELP-5
No Aug 88.77 52.29 96.40 95.85 65.55
TAA 89.37 52.55 97.07 96.04 65.73
n_aug 4 4 4 2 2

More pre-searched policies and their performance will be COMING SOON.

2. Fine-tune a new model on the augmented training dataset

After getting augmented_train_dataset, you can load it to the huggingface trainer directly. Please refer to search_augment_train.py for details.

Reproduce results in the paper

Please see examples/reproduce_experiment.py, and run script/huggingface_lowresource.sh or script/huggingface_imbalanced.sh.

Contact

If you have any questions related to the code or the paper, feel free to open an issue.

Acknowledgments

Code refers to: fast-autoaugment.

Citation

If you find this code useful for your research, please consider citing:

@inproceedings{ren2021taa,
    title = "Text {A}uto{A}ugment: Learning Compositional Augmentation Policy for Text Classification",
    author = "Ren, Shuhuai and Zhang, Jinchao and Li, Lei and Sun, Xu and Zhou, Jie",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    year = "2021",
}

License

MIT

More Repositories

1

pkuseg-python

pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation
Python
6,430
star
2

SGM

Sequence Generation Model for Multi-label Classification (COLING 2018)
Python
429
star
3

Chinese-Literature-NER-RE-Dataset

A Discourse-Level Named Entity Recognition and Relation Extraction Dataset for Chinese Literature Text
399
star
4

Global-Encoding

Global Encoding for Abstractive Summarization (ACL 2018)
Python
273
star
5

Graph-to-seq-comment-generation

Code for the paper ``Coherent Comments Generation for Chinese Articles with a Graph-to-Sequence Model''
Python
174
star
6

SU4MLC

Code for the article "Semantic-Unit-Based Dilated Convolution for Multi-Label Text Classification" (EMNLP 2018)
Python
154
star
7

DPGAN

Diversity-Promoting Generative Adversarial Network for Generating Informative and Diversified Text (EMNLP2018)
Python
144
star
8

superAE

Code for "Autoencoder as Assistant Supervisor: Improving Text Representation for Chinese Social Media Text Summarization"
Python
136
star
9

AdaMod

Adaptive and Momental Bounds for Adaptive Learning Rate Methods.
Python
125
star
10

livebot

LiveBot: Generating Live Video Comments Based on Visual and Textual Contexts (AAAI 2019)
Python
124
star
11

label-words-are-anchors

Repository for Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning
Python
115
star
12

meProp

meProp: Sparsified Back Propagation for Accelerated Deep Learning (ICML 2017)
C#
110
star
13

Unpaired-Sentiment-Translation

Code for "Unpaired Sentiment-to-Sentiment Translation: A Cycled Reinforcement Learning Approach" (ACL 2018)
Python
107
star
14

WEAN

Code for "Query and Output: Generating Words by Querying Distributed Word Representations for Paraphrase Generation" (NAACL 2018)
Python
93
star
15

label-embedding-network

Label Embedding Network
Python
90
star
16

Prime

A simple module consistently outperforms self-attention and Transformer model on main NMT datasets with SoTA performance.
Python
85
star
17

AAPR

Automatic Academic Paper Rating: Data and Model (ACL 2018)
Python
72
star
18

Skeleton-Based-Generation-Model

Code for "A Skeleton-Based Model for Promoting Coherence Among Sentences in Narrative Story Generation" (EMNLP 2018)
Python
64
star
19

Explicit-Sparse-Transformer

code for Explicit Sparse Transformer
Python
55
star
20

SMAE

This is the code for "Learning Sentiment Memories for Sentiment Modification without Parallel Data".
Python
55
star
21

LancoSum

A toolkit for abstractive summarization, which is easy to implement the baseline and our proposed models, which can achieve the SOTA performance.
Python
50
star
22

Seq2Set

Code for the paper "A Deep Reinforced Sequence-to-Set Model for Multi-Label Classification"
Python
50
star
23

AMM

The code for "An Auto-Encoder Matching Model for Learning Utterance-Level Semantic Dependency in Dialogue Generation" (EMNLP 2018)
Python
49
star
24

bag-of-words

Code for "Bag-of-Words as Target for Neural Machine Translation"
Python
45
star
25

AdaNorm

Code for "Understanding and Improving Layer Normalization"
Python
43
star
26

SRB

Code for "Improving Semantic Relevance for Sequence-to-Sequence Learning of Chinese Social Media Text Summarization"
Python
41
star
27

DynamicKD

Code for EMNLP 2021 main conference paper "Dynamic Knowledge Distillation for Pre-trained Language Models"
Python
38
star
28

simNet

Code for "simNet: Stepwise Image-Topic Merging Network for Generating Detailed and Comprehensive Image Captions" (EMNLP 2018)
Python
37
star
29

Embedding-Poisoning

Code for the paper "Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models" (NAACL-HLT 2021)
Python
34
star
30

well-classified-examples-are-underestimated

Code for the AAAI 2022 publication "Well-classified Examples are Underestimated in Classification with Deep Neural Networks"
Jupyter Notebook
32
star
31

IAIS

[ACL 2021] Learning Relation Alignment for Calibrated Cross-modal Retrieval
Python
30
star
32

Chinese-Dependency-Treebank-with-Ellipsis

An Ellipsis-aware Chinese Dependency Treebank for Web Text
Python
26
star
33

DeconvDec

Code for "Deconvolution-Based Global Decoding for Neural Machine Translation" (COLING 2018).
Python
26
star
34

HSSC

Code for "A Hierarchical End-to-End Model for Jointly Improving Text Summarization and Sentiment Classification" (IJCAI 2018)
Python
23
star
35

tcm_prescription_generation

Code for "Exploration on Generating Traditional Chinese Medicine Prescriptions from Symptoms with an End-to-End Approach"
Python
23
star
36

clip-openness

[ACL 2023] Delving into the Openness of CLIP
Python
22
star
37

CGM

Code for IJCAI 2021 main conference paper "Long-term, Short-term and Sudden Event: Trading Volume Movement Prediction with Graph-based Multi-view Modeling"
Python
21
star
38

SOS

Code for the paper "Rethinking Stealthiness of Backdoor Attack against NLP Models" (ACL-IJCNLP 2021)
Jupyter Notebook
21
star
39

codable-watermarking-for-llm

Repository for Towards Codable Watermarking for Large Language Models
Python
20
star
40

CMAC

The dataset and code for the paper "Cross-Modal Commentator: Automatic Machine Commenting Based on Cross-Modal Information"
Python
20
star
41

RAP

Code for the paper "RAP: Robustness-Aware Perturbations for Defending against Backdoor Attacks on NLP Models" (EMNLP 2021)
Python
19
star
42

MUKI

[Findings of EMNLP22] From Mimicking to Integrating: Knowledge Integration for Pre-Trained Language Models
Python
19
star
43

ChineseNER

Code for "Cross-Domain and Semi-Supervised Named Entity Recognition in Chinese Social Media: A Unified Model"
Python
18
star
44

meSimp

Codes for "Training Simplification and Model Simplification for Deep Learning: A Minimal Effort Back Propagation Method"
C#
18
star
45

LexicalAT

Codes for paper "LexicalAT: Lexical-Based Adversarial Reinforcement Training for Robust Sentiment Classification"
Python
17
star
46

Pivot

Code for "Key Fact as Pivot: A Two-Stage Model for Low Resource Table-to-Text Generation" (ACL 2019)
Python
17
star
47

Avg-Avg

[Findings of EMNLP 2022] Holistic Sentence Embeddings for Better Out-of-Distribution Detection
Python
16
star
48

RMSC

Data and code for paper "Review-Driven Multi-Label Music Style Classification by Exploiting Style Correlations"
Python
14
star
49

Decode-CRF

Conditional Random Fields with Decode-based Learning
C#
14
star
50

agent-backdoor-attacks

Code&Data for the paper "Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents"
14
star
51

nndep

Transition-based Dependency Parser with neural networks and hybrid oracle
C#
14
star
52

SAPO

C# code for "Towards Easier and Faster Sequence Labeling for Natural Language Processing: A Search-based Probabilistic Online Learning Framework (SAPO)" (Information Sciences)
C#
13
star
53

SACT

Code for the article "Automatic Temperature Control for Neural Machine Translation" (EMNLP 2018)
Python
13
star
54

Augmented_Data_for_FST

The augmented data of the paper "Parallel Data Augmentation for Formality Style Transfer" (ACL 2020).
12
star
55

ACA4NMT

Code of a novel model for NMT
Python
11
star
56

CascadeBERT

Code for CascadeBERT, Findings of EMNLP 2021
Python
11
star
57

DCKD

Code and data for Distributional Correlation–Aware Knowledge Distillation for Stock Trading Volume Prediction (ECML-PKDD 22)
Python
10
star
58

Multi-Order-LSTM

Code for "Does Higher Order LSTM Have Better Accuracy for Segmenting and Labeling Sequence Data?"
Python
9
star
59

SemPre

Towards Semantics-Enhanced Pre-Training: Can Lexicon Definitions Help Learning Sentence Meanings? (AAAI 2021)
Python
9
star
60

DAN

[Findings of EMNLP 2022] Expose Backdoors on the Way: A Feature-Based Efficient Defense against Textual Backdoor Attacks
Python
9
star
61

FedMNMT

[Findings of ACL 2023] Communication Efficient Federated Learning for Multilingual Machine Translation with Adapter
Python
9
star
62

CVST

Code for paper "Knowledgeable Storyteller: A Commonsense-Driven Generative Model for Visual Storytelling"
7
star
63

Early-Exit

Code for the paper: A Global Past-Future Early Exit Method for Accelerating Inference of Pre-trained Language Models.
Python
7
star
64

Multi-Task-Learning

Online Multi-Task Learning Toolkit based on C#; code for "Large-Scale Personalized Human Activity Recognition using Online Multi-Task Learning" (TKDE)
C#
6
star
65

NLP_Code_Index

codes and papers from @lancopku
5
star
66

CRF-ADF

CRF Toolkit based on C#; support ADF (Adaptive stochastic gradient Decent based on Feature-frequency information, ACL 2012)
C#
4
star
67

GKD

Python
4
star
68

Sememe_prediction

Code for paper "Sememe Prediction: Learning Semantic Knowledge from Unstructured Textual Wiki Descriptions"
Python
3
star
69

LPVDN

Python code for paper - Learning Robust Representation for Clustering through Locality Preserving Variational Discriminative Network
Python
3
star
70

Attention-Augmentation

Python
2
star
71

GNOME

Code of the EACL 2023 Paper: Fine-Tuning Deteriorates General Textual Out-of-Distribution Detection by Distorting Task-Agnostic Features
1
star
72

MR-VPC

Towards Multimodal Video Paragraph Captioning Models Robust to Missing Modality
1
star