• Stars
    star
    181
  • Rank 210,846 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 2 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

[ICLR 2023] Codebase for Copy-Generator model, including an implementation of kNN-LM

Source code for Copy is All You Need

This repository contains code and resources of our paper,

Copy is All You Need. ICLR2023

Tian Lan, Deng Cai, Yan Wang, Heyan Huang, Xian-Ling Mao

Catalogue:


1. Introduction: [Back to Top]

The dominant text generation models compose output by selecting words in a fixed vocabulary. In this paper, we formulate text generation as progressively copying text segments (e.g., words or phrases) from an existing text collection. We compute the contextualized representations of meaningful text segments and index them using efficient vector search toolkits. The task of text generation is then decomposed into a series of copy-and-paste operations: at each time step, we seek suitable text spans from existing articles in the text collection rather than selecting from a standalone vocabulary. Experiments on the standard language modeling benchmark (WikiText-103) show that our approach achieves better generation quality by coping from the original training data (0.758 vs. 0.691 MAUVE). We also show that our approach attains additional performance gains by simply scaling up to larger text collections without extra training. Furthermore, our approach allows for effective domain adaptation by simply switching to any domain-specific text collection, again without further training. Finally, we observe that our approach achieves better inference efficiency than standard token-level autoregressive models thanks to the reduction of decoding steps.

overview

Three benchmarks are used in this paper, and their preprocessing procedures are listed under data folder (wikitext103, en_wiki, lawmt).


2. Prepare the Dataset: [Back to Top]

The corpus for Wikitext-103, Law-MT, and En-Wiki can be downloaded from this link (with the code ufhn). For wikitext-103, law-mt, and en-wiki datasets, please move their corresponding base_data_128.txt and test.txt into data/{dataset_name}_1024, and conduct the commands in data/README.md to process theses datasets.


3. Train the Models: [Back to Top]

1. prepare the environment
pip install -r requirments.txt
2. get into the folder and initialize the workspace
cd copyisallyouneed;
python prepare_work_space.py

Running the prepare_work_space.py script will initialize folders under the root_dir (defined in config/base.yaml):

  • log: save the backup of the previous checkpoints
  • ckpt: save the checkpoints of the trained models
  • rest: save the tensorboard log files during the training

Before the running, make sure the root_dir variable is renamed on your local environemnt.

3. running baselines

The following examples runs on wikinews benchmark, replace it with wikitext or story to test other benchmark. Noted that the training args and details are listed under the config/*.yaml.

  1. train the gpt2 baseline

    # distributted train the gpt2 model on wikitext103 dataset
    ./scripts/train.sh wikitext103 gpt2 0,1,2,3,4,5,6,7
  2. train the retro baseline

    Follow the description under the baseline/retro/README.md to train the retro baseline

  3. train the KNN-LM baseline

    Noted that the KNN-LM baseline is built upon the GPT2 baseline. Here, we aim to inference the whole dataset to build the FAISS index for KNN-LM.

    ./scripts/knnlm_inference.sh 0;
    
    # build the FAISS index: more details, such as the faiss index type can be found in `build_index.py`
    python build_index.py
  4. train the copyisallyouneed model

    ./scripts/train.sh wikitext103 copyisallyouneed 0,1,2,3,4,5,6,7
4. Test the Models: [Back to Top]

After the training procedure, the following commands are conducted to generate the results file for automatic and human evaluations. More details about the inference can be found in these corresponding bash scripts.

  1. generate the results for gpt2 baseline

    ./scripts/gpt2_test.sh
  2. generate the results for retro baseline

    Following the details under baseline/retro/README.md

  3. generate the results for KNN-LM baseline

    ./scripts/knnlm_test.sh
  4. generate the results for Copyisallyouneed baseline

    ./scripts/copyisallyouneed_test.sh

Running the above scripts will generate the corresponding results files under the copyisallyouneed folder with the clear name. For the automatic evaluation, move these files to the evaluation/* folder to test MAUVE, Diversity/Rep-n, and Coherence. More details about the automatic evaluation procedure can be found in evaluation/README.md.

For the human evaluation, move these files to the make_human_evaluation/raw_files/ folder, and run this command:

./run.sh

More details about the human evaluation can be found in make_human_evaluation/README.md.

Contact

If you have any questions, feel free to contact me via (lantiangmftby at gmail.com).

Citation

@inproceedings{
    lan2023copy,
    title={Copy is All You Need},
    author={Tian Lan and Deng Cai and Yan Wang and Heyan Huang and Xian-Ling Mao},
    booktitle={The Eleventh International Conference on Learning Representations },
    year={2023},
    url={https://openreview.net/forum?id=CROlOA9Nd8C}
}

More Repositories

1

MultiTurnDialogZoo

Multi-turn dialogue baselines written in PyTorch
Python
162
star
2

science-llm

A large-scale language model for scientific domain, trained on redpajama arXiv split
Python
120
star
3

OpenDialog

An Open-Source Package for Chinese Open-domain Conversational Chatbot (中文闲聊对话系统,一键部署微信闲聊机器人)
Python
108
star
4

SimpleReDial-v1

The sources codes of the DR-BERT model and baselines
Python
38
star
5

RUBER-and-Bert-RUBER

Implementation of RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems
Python
29
star
6

Rep-Dropout

[NeurIPS 2023] Repetition In Repetition Out: Towards Understanding Neural Text Degeneration from the Data Perspective
Python
27
star
7

MomentumDecoding

Momentum Decoding: Open-ended Text Generation as Graph Exploration
Python
19
star
8

EDA-NLP-Chinese

Easy Data Augmentation for NLP on Chinese
Python
16
star
9

PONE

Jupyter Notebook
13
star
10

GPT2Dialog

English or Chinses GPT2Dialog model from GPT2-chitchat
Python
11
star
11

Study

Good good study, day day ugly
Jupyter Notebook
10
star
12

Primary_Explainable_Factual_Consistency_Evaluation_Model

The simple demo of explainable factual consistency evaluation model, optimzing InternLM-7B by QLoRA
Python
10
star
13

WhenToSpeak

The codes of our paper When to Talk: Chatbot Controls the Timing of Talking during Multi-turn Open-domain Dialogue Generation
Jupyter Notebook
9
star
14

BIT-PSO

PSO Algorithm for solving the JSP problem
Python
6
star
15

Transformer-Dialog

PyTorch Transformer Dialogue Model
Python
6
star
16

EasyNLP

Python
5
star
17

EvidenceRetrievalLeaderboard

The leaderboard for evidence retrieval task
5
star
18

General-Zero

The AlphaZero for the WTN-EinStein Chess
Python
5
star
19

FeedbackPreference

This is the repo for our proposed Feedback Preference corpus
Python
4
star
20

SurveyFactory

All the survey I made, save the idea, help the newbie, review for myself
4
star
21

DeepLearning-Course

DeepLearning Course (RL, RNN, CNN, TextGen)
Jupyter Notebook
3
star
22

BITNLP

NLP project
Python
2
star
23

housechat

https://www.datafountain.cn/competitions/474
Python
2
star
24

CCompilerInPython

The Simple C Compiler in Python
HTML
2
star
25

HashRetrieval

Learning to Hash for Coarse Retrieval in Open-Domain Dialog Systems
Python
1
star
26

DataHammer-Training-Room

Machine Learning Practice for students get in touch with Group DataHammer
Shell
1
star
27

Paper-shredder

Paper, paper, and paper
Python
1
star
28

ubuntu-v2

Jupyter Notebook
1
star