• Stars
    star
    429
  • Rank 97,768 (Top 2 %)
  • Language
    Python
  • Created almost 6 years ago
  • Updated over 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Sequence Generation Model for Multi-label Classification (COLING 2018)

Sequence Generation Model for Multi-label Classification

This is the code for our paper SGM: Sequence Generation Model for Multi-label Classification [pdf]


Note

In general, this code is more suitable for the following application scenarios:

  • The dataset is relatively large:
    • The performance of the seq2seq model depends on the size of the dataset.
  • There exist some orders or dependencies between labels:
    • A reasonable prior order of labels tends to be helpful.

Requirements

  • Ubuntu 16.0.4
  • Python version >= 3.5
  • PyTorch version >= 1.0.0

Dataset

Our used RCV1-V2 dataset can be downloaded from google drive with this link. The structure of the folders on drive is:

Google Drive Root		   # The compressed zip file
 |-- data                          # The unprocessed raw data files
 |    |-- train.src        
 |    |-- train.tgt
 |    |-- valid.src
 |    |-- valid.tgt
 |    |-- test.src
 |    |-- test.tgt
 |    |-- topic_sorted.json        # The json file of label set for evaluation
 |-- checkpoints                   # The pre-trained model checkpoints
 |    |-- sgm.pt
 |    |-- sgmge.pt

We found that the valid-set in the previous version is so small that the model tends to overfit the valid-set, resulting in unstable performance. Therefore, we have expanded the valid-set. In addition, we also filtered out samples that contain more than 500 words in the original RCV1-V2 dataset.


Reproducibility

We provide the pretrained checkpoints of the SGM model and the SGM+GE model on the RCV1-V2 dataset to help you to reproduce our reported experimental results. The detailed reproduction steps are as follows:

  • Please download the RCV1-V2 dataset and checkpoints first by clicking on the link, then put them in the same directory as these codes. The correct structure of the folders should be:
Root
 |-- data                          
 |    |-- ...        
 |-- checkpoints                   
 |    |-- ...
 |-- models                   
 |    |-- ...
 |-- utils                   
 |    |-- ...
 |-- preprocess.py
 |-- train.py
 |-- ...
  • Preprocess the downloaded data:
python3 preprocess.py -load_data ./data/ -save_data ./data/save_data/ -src_vocab_size 50000

All the preprocessed data will be stored in the folder ./data/save_data/

  • Perform prediction and evaluation:
python3 predict.py -gpus gpu_id -data ./data/save_data/ -batch_size 64 -restore ./checkpoints/sgm.pt -log results/

The predicted labels and evaluation scores will be stored in the folder results


Training from scratch

Preprocessing

You can preprocess the dataset with the following command:

python3 preprocess.py \
	-load_data load_data_path \       # input file dir for the data
	-save_data save_data_path \       # output file dir for the processed data
	-src_vocab_size 50000             # size of the source vocabulary

Note that all data path must end with /. Other parameter descriptions can be found in preprocess.py


Training

You can perform model training with the following command:

python3 train.py -gpus gpu_id -config model_config -log save_path

All log files and checkpoints during training will be saved in save_path. The detailed parameter descriptions can be found in train.py


Testing

You can perform testing with the following command:

python3 predict.py -gpus gpu_id -data save_data_path -batch_size batch_size -log log_path

The predicted labels and evaluation scores will be stored in the folder log_path. The detailed parameter descriptions can be found in predict.py


Citation

If you use the above code for your research, please cite the paper:

@inproceedings{YangCOLING2018,
  author    = {Pengcheng Yang and
               Xu Sun and
               Wei Li and
               Shuming Ma and
               Wei Wu and
               Houfeng Wang},
  title     = {{SGM:} Sequence Generation Model for Multi-label Classification},
  booktitle = {Proceedings of the 27th International Conference on Computational
               Linguistics, {COLING} 2018, Santa Fe, New Mexico, USA, August 20-26,
               2018},
  pages     = {3915--3926},
  year      = {2018}
}

More Repositories

1

pkuseg-python

pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation
Python
6,430
star
2

Chinese-Literature-NER-RE-Dataset

A Discourse-Level Named Entity Recognition and Relation Extraction Dataset for Chinese Literature Text
399
star
3

Global-Encoding

Global Encoding for Abstractive Summarization (ACL 2018)
Python
273
star
4

Graph-to-seq-comment-generation

Code for the paper ``Coherent Comments Generation for Chinese Articles with a Graph-to-Sequence Model''
Python
174
star
5

SU4MLC

Code for the article "Semantic-Unit-Based Dilated Convolution for Multi-Label Text Classification" (EMNLP 2018)
Python
154
star
6

DPGAN

Diversity-Promoting Generative Adversarial Network for Generating Informative and Diversified Text (EMNLP2018)
Python
144
star
7

superAE

Code for "Autoencoder as Assistant Supervisor: Improving Text Representation for Chinese Social Media Text Summarization"
Python
136
star
8

AdaMod

Adaptive and Momental Bounds for Adaptive Learning Rate Methods.
Python
125
star
9

livebot

LiveBot: Generating Live Video Comments Based on Visual and Textual Contexts (AAAI 2019)
Python
124
star
10

text-autoaugment

[EMNLP 2021] Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification
Python
124
star
11

label-words-are-anchors

Repository for Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning
Python
115
star
12

meProp

meProp: Sparsified Back Propagation for Accelerated Deep Learning (ICML 2017)
C#
110
star
13

Unpaired-Sentiment-Translation

Code for "Unpaired Sentiment-to-Sentiment Translation: A Cycled Reinforcement Learning Approach" (ACL 2018)
Python
107
star
14

WEAN

Code for "Query and Output: Generating Words by Querying Distributed Word Representations for Paraphrase Generation" (NAACL 2018)
Python
93
star
15

label-embedding-network

Label Embedding Network
Python
90
star
16

Prime

A simple module consistently outperforms self-attention and Transformer model on main NMT datasets with SoTA performance.
Python
85
star
17

AAPR

Automatic Academic Paper Rating: Data and Model (ACL 2018)
Python
72
star
18

Skeleton-Based-Generation-Model

Code for "A Skeleton-Based Model for Promoting Coherence Among Sentences in Narrative Story Generation" (EMNLP 2018)
Python
64
star
19

Explicit-Sparse-Transformer

code for Explicit Sparse Transformer
Python
55
star
20

SMAE

This is the code for "Learning Sentiment Memories for Sentiment Modification without Parallel Data".
Python
55
star
21

LancoSum

A toolkit for abstractive summarization, which is easy to implement the baseline and our proposed models, which can achieve the SOTA performance.
Python
50
star
22

Seq2Set

Code for the paper "A Deep Reinforced Sequence-to-Set Model for Multi-Label Classification"
Python
50
star
23

AMM

The code for "An Auto-Encoder Matching Model for Learning Utterance-Level Semantic Dependency in Dialogue Generation" (EMNLP 2018)
Python
49
star
24

bag-of-words

Code for "Bag-of-Words as Target for Neural Machine Translation"
Python
45
star
25

AdaNorm

Code for "Understanding and Improving Layer Normalization"
Python
43
star
26

SRB

Code for "Improving Semantic Relevance for Sequence-to-Sequence Learning of Chinese Social Media Text Summarization"
Python
41
star
27

DynamicKD

Code for EMNLP 2021 main conference paper "Dynamic Knowledge Distillation for Pre-trained Language Models"
Python
38
star
28

simNet

Code for "simNet: Stepwise Image-Topic Merging Network for Generating Detailed and Comprehensive Image Captions" (EMNLP 2018)
Python
37
star
29

Embedding-Poisoning

Code for the paper "Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models" (NAACL-HLT 2021)
Python
34
star
30

well-classified-examples-are-underestimated

Code for the AAAI 2022 publication "Well-classified Examples are Underestimated in Classification with Deep Neural Networks"
Jupyter Notebook
32
star
31

IAIS

[ACL 2021] Learning Relation Alignment for Calibrated Cross-modal Retrieval
Python
30
star
32

Chinese-Dependency-Treebank-with-Ellipsis

An Ellipsis-aware Chinese Dependency Treebank for Web Text
Python
26
star
33

DeconvDec

Code for "Deconvolution-Based Global Decoding for Neural Machine Translation" (COLING 2018).
Python
26
star
34

HSSC

Code for "A Hierarchical End-to-End Model for Jointly Improving Text Summarization and Sentiment Classification" (IJCAI 2018)
Python
23
star
35

tcm_prescription_generation

Code for "Exploration on Generating Traditional Chinese Medicine Prescriptions from Symptoms with an End-to-End Approach"
Python
23
star
36

clip-openness

[ACL 2023] Delving into the Openness of CLIP
Python
22
star
37

CGM

Code for IJCAI 2021 main conference paper "Long-term, Short-term and Sudden Event: Trading Volume Movement Prediction with Graph-based Multi-view Modeling"
Python
21
star
38

SOS

Code for the paper "Rethinking Stealthiness of Backdoor Attack against NLP Models" (ACL-IJCNLP 2021)
Jupyter Notebook
21
star
39

codable-watermarking-for-llm

Repository for Towards Codable Watermarking for Large Language Models
Python
20
star
40

CMAC

The dataset and code for the paper "Cross-Modal Commentator: Automatic Machine Commenting Based on Cross-Modal Information"
Python
20
star
41

RAP

Code for the paper "RAP: Robustness-Aware Perturbations for Defending against Backdoor Attacks on NLP Models" (EMNLP 2021)
Python
19
star
42

MUKI

[Findings of EMNLP22] From Mimicking to Integrating: Knowledge Integration for Pre-Trained Language Models
Python
19
star
43

ChineseNER

Code for "Cross-Domain and Semi-Supervised Named Entity Recognition in Chinese Social Media: A Unified Model"
Python
18
star
44

meSimp

Codes for "Training Simplification and Model Simplification for Deep Learning: A Minimal Effort Back Propagation Method"
C#
18
star
45

LexicalAT

Codes for paper "LexicalAT: Lexical-Based Adversarial Reinforcement Training for Robust Sentiment Classification"
Python
17
star
46

Pivot

Code for "Key Fact as Pivot: A Two-Stage Model for Low Resource Table-to-Text Generation" (ACL 2019)
Python
17
star
47

Avg-Avg

[Findings of EMNLP 2022] Holistic Sentence Embeddings for Better Out-of-Distribution Detection
Python
16
star
48

RMSC

Data and code for paper "Review-Driven Multi-Label Music Style Classification by Exploiting Style Correlations"
Python
14
star
49

Decode-CRF

Conditional Random Fields with Decode-based Learning
C#
14
star
50

agent-backdoor-attacks

Code&Data for the paper "Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents"
14
star
51

nndep

Transition-based Dependency Parser with neural networks and hybrid oracle
C#
14
star
52

SAPO

C# code for "Towards Easier and Faster Sequence Labeling for Natural Language Processing: A Search-based Probabilistic Online Learning Framework (SAPO)" (Information Sciences)
C#
13
star
53

SACT

Code for the article "Automatic Temperature Control for Neural Machine Translation" (EMNLP 2018)
Python
13
star
54

Augmented_Data_for_FST

The augmented data of the paper "Parallel Data Augmentation for Formality Style Transfer" (ACL 2020).
12
star
55

ACA4NMT

Code of a novel model for NMT
Python
11
star
56

CascadeBERT

Code for CascadeBERT, Findings of EMNLP 2021
Python
11
star
57

DCKD

Code and data for Distributional Correlation–Aware Knowledge Distillation for Stock Trading Volume Prediction (ECML-PKDD 22)
Python
10
star
58

Multi-Order-LSTM

Code for "Does Higher Order LSTM Have Better Accuracy for Segmenting and Labeling Sequence Data?"
Python
9
star
59

SemPre

Towards Semantics-Enhanced Pre-Training: Can Lexicon Definitions Help Learning Sentence Meanings? (AAAI 2021)
Python
9
star
60

DAN

[Findings of EMNLP 2022] Expose Backdoors on the Way: A Feature-Based Efficient Defense against Textual Backdoor Attacks
Python
9
star
61

FedMNMT

[Findings of ACL 2023] Communication Efficient Federated Learning for Multilingual Machine Translation with Adapter
Python
9
star
62

CVST

Code for paper "Knowledgeable Storyteller: A Commonsense-Driven Generative Model for Visual Storytelling"
7
star
63

Early-Exit

Code for the paper: A Global Past-Future Early Exit Method for Accelerating Inference of Pre-trained Language Models.
Python
7
star
64

Multi-Task-Learning

Online Multi-Task Learning Toolkit based on C#; code for "Large-Scale Personalized Human Activity Recognition using Online Multi-Task Learning" (TKDE)
C#
6
star
65

NLP_Code_Index

codes and papers from @lancopku
5
star
66

CRF-ADF

CRF Toolkit based on C#; support ADF (Adaptive stochastic gradient Decent based on Feature-frequency information, ACL 2012)
C#
4
star
67

GKD

Python
4
star
68

Sememe_prediction

Code for paper "Sememe Prediction: Learning Semantic Knowledge from Unstructured Textual Wiki Descriptions"
Python
3
star
69

LPVDN

Python code for paper - Learning Robust Representation for Clustering through Locality Preserving Variational Discriminative Network
Python
3
star
70

Attention-Augmentation

Python
2
star
71

GNOME

Code of the EACL 2023 Paper: Fine-Tuning Deteriorates General Textual Out-of-Distribution Detection by Distorting Task-Agnostic Features
1
star
72

MR-VPC

Towards Multimodal Video Paragraph Captioning Models Robust to Missing Modality
1
star