• Stars
    star
    101
  • Rank 338,166 (Top 7 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 2 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Code for ACL2023 paper: Pre-Training to Learn in Context

Pre-Training to Learn in Context

This repository contains the code of our ACL2023 paper:

Pre-Training to Learn in Context

In this work, we propose PICL (Pre-training for In-Context Learning), a framework to enhance the language models' in-context learning ability by pre-training the model on a large collection of ``intrinsic tasks'' in the general plain-text corpus using the simple language modeling objective. PICL encourages the model to infer and perform tasks by conditioning on the contexts while maintaining task generalization of pre-trained models.

PICL

1 Install

# setup environment with conda
conda create -n picl python=3.8
# install basic packages
pip3 install -r requirements.txt
conda install faiss-gpu -c pytorch
# install transformers & promptsource
pip3 install -e transformers
pip3 install -e promptsource

2 Prepare Plain-Text Corpus

Download OpenWebText, Wikicorpus, and Bookcorpus. Run tools/prepare_raw_data.py to get all full_documents and merge them:

python3 tools/prepare_raw_data.py /PATH/TO/openwebtext pretrain_data/raw/openwebtext.txt
python3 tools/prepare_raw_data.py /PATH/TO/wikicorpus pretrain_data/raw/wikicorpus.txt
python3 tools/prepare_raw_data.py /PATH/TO/bookcorpus pretrain_data/raw/bookcorpus.txt
cat pretrain_data/raw/openwebtext.txt pretrain_data/raw/wikicorpus.txt pretrain_data/raw/bookcorpus.txt > pretrain_data/raw/merge_no_shuf.txt
shuf -o pretrain_data/raw/merge_no_shuf.txt pretrain_data/raw/merge.txt

The "\n" tokens in full documents are replace by a special token "<@x(x!>" such that each document occupy a single line in the file.

3 Run the Pipeline

Run the entire pipeline in a toy setting (corpus size = 100K) with

bash pipeline.sh

${BASE_PATH} is the path of the directory of this project.

The details of each step in the pipeline are shown in the following sections.

4 Construct PICL Data

We release the constructed PICL data in this link.

You can check the same-intrinsic-task paragraphs by running python3 check_picl_data.py and then entering an interger index to pick a query and the retrieved paragraphs:

Latex Equation Translation
Input Paragraph Index >>>11156                                                         
##########  Query  ##########
ω p = I s ω s I p cos ⁡ ( α ) {\displaystyle {\boldsymbol {\omega }}_{\mathrm {p} }={\frac {{\boldsymbol {I}}_{\mathrm {s} }{\boldsymbol {\omega }}_{\mathrm {s} }}{{\boldsymbo
l {I}}_{\mathrm {p} }\cos({\boldsymbol {\alpha }})}}}

##########  Retrieved Paragraph #1  ##########
τ b ∗ = τ b ( ρ s − ρ f ) ( g ) ( D ) {\displaystyle \tau _{b}*={\frac {\tau _{b}}{(\rho _{s}-\rho _{f})(g)(D)}}}


##########  Retrieved Paragraph #2  ##########
M H ≤ ℏ c 3 8 π G k B T u {\displaystyle M_{\mathrm {H} }\leq {\frac {\hbar c^{3}}{8\pi Gk_{\mathrm {B} }T_{\mathrm {u} }}}}

...
Question Answering
##########  Query  ##########
Question: Where would a gnarly off-road racer like Tanner Foust meet up with a frightened five-year-old child with leukemia? Answer: In a hospital, of course!


##########  Retrieved Paragraph #1  ##########
Question: What do a siren, an in-wall light switch, a sleep sensing iPhone dock, and a flood detector have in common? Answer: They are all SmartThings!


##########  Retrieved Paragraph #2  ##########
Question: Where do you find a one legged dog? Answer: Where you left it.
...

Here are some indices for interesting paragraphs. Try it out!

Indices
0
8
109
1000
4645
5384
9473
11156
11969
12231
17838
17849
28844
28845
37577
40119
59996
85034
90096
97616

You can also constructe the PICL data from scratch following the instructions below.

4.1 Preprocessing and Toknization

Tokenize and store full documents and paragraphs into binary files.

  • Split full documents into paragraphs.
    bash scripts/tools/process_corpus.sh
  • Process full document data. The scripts will generate .bin and .idx files.
    bash scripts/tools/process_full_doc_data_gpt2.sh ${BASE_PATH}
  • Tokenize paragraphs in corpus. The scripts will generate .bin and .idx files.
    bash scripts/tools/process_picl_data_gpt2.sh ${BASE_PATH}

NOTE

Since the corpus is large, the .bin file of full document data will be about 29G and the paragraph data will be about 13G. The data processing may take long, and there may be unexpected problems that stuck the process (like running out of CPU memories). To handle this issue, you can split the merge.txt file to multiple files like:

split -C 1000M merge.txt

And then, you can process the split files one by one (by setting different picl-data-name and bin-file-index in process_full_doc_data_gpt2.sh), each takes less time and has less risk of running into problems. Assume that you have generated two (.bin, .idx) pairs:

train_lm_1.bin
train_lm_1.idx
train_lm_2.bin
train_lm_2.idx

You can finally merge them by runing

bash scripts/tools/merge_bin_files.sh ${BASE_PATH}

which will merge the two pairs into train_lm_0.bin and train_lm_0.idx.

4.2 Retrival

  • Process training data for retriever. Construct hard negatives. The raw data and the preprocessed data can be downloaded from this link. To do data processing, put the raw datasets under data/ and run the following command:
    python3 tools/process_retriever_train_data.py --save retriever_data --data-names TRAIN
  • Train the retriever. The train.jsonl and valid.jsonl data should be put in retriever_data/TRAIN/p1_en1_hn4_s42/merge. The trained retriever can be downloaded from this link.
    bash scripts/retriever/train.sh ${BASE_PATH}
  • Get encoded paragraphs.
    bash scripts/retriever/infer.sh ${BASE_PATH}
  • Search for paragraphs that share the same intrinsic tasks.
    bash scripts/retriever/search.sh ${BASE_PATH}

4.2 Filter

  • Filter out non-informative samples.
    bash scripts/filter/filter.sh ${BASE_PATH}

5 Pre-train

  • Pre-train the LM with PICL. The pre-trained models can be downloaded from this link.
    bash scripts/pretrain/pretrain_picl_gpt2_large.sh ${BASE_PATH}

6 Evaluation

  • Evaluate the trained model on text classification datasets and super-natural instructions. The evaluation data can be downloaded from this link.
    bash scripts/eval/eval_cls.sh ${BASE_PATH}
    bash scripts/eval/eval_inst.sh ${BASE_PATH}

7 Citation

@inproceedings{gu2023picl,
  title={Pre-Training to Learn in Context},
  author={Gu, Yuxian and Dong, Li and Wei, Furu and Huang, Minlie},
  booktitle={Proceedings of ACL},
  year={2023}
}

More Repositories

1

CDial-GPT

A Large-scale Chinese Short-Text Conversation Dataset and Chinese pre-training dialog models
Python
1,678
star
2

Safety-Prompts

Chinese safety prompts for evaluating and improving the safety of LLMs. 中文安全prompts,用于评估和提升大模型的安全性。
837
star
3

CrossWOZ

A Large-Scale Chinese Cross-Domain Task-Oriented Dialogue Dataset
Python
580
star
4

KdConv

KdConv: A Chinese Multi-domain Dialogue Dataset Towards Multi-turn Knowledge-driven Conversation
Python
455
star
5

ConvLab-2

ConvLab-2: An Open-Source Toolkit for Building, Evaluating, and Diagnosing Dialogue Systems
Python
449
star
6

CharacterGLM-6B

CharacterGLM: Customizing Chinese Conversational AI Characters with Large Language Models
Python
395
star
7

EVA

EVA: Large-scale Pre-trained Chit-Chat Models
Python
304
star
8

BPO

Python
281
star
9

Emotional-Support-Conversation

Data and codes for ACL 2021 paper: Towards Emotional Support Dialog Systems
Python
227
star
10

ccm

This project is a tensorflow implement of our work, CCM (Commonsense Conversational Model).
Python
218
star
11

ecm

This project is a tensorflow implement of our work, ECM (emotional chatting machine).
Python
216
star
12

NLG_book

书籍《现代自然语言生成》介绍
214
star
13

COLDataset

The official repository of the paper: COLD: A Benchmark for Chinese Offensive Language Detection
201
star
14

PaperForONLG

Paper list for open-ended language generation
187
star
15

PsyQA

一个中文心理健康支持问答数据集,提供了丰富的援助策略标注。可用于生成富有援助策略的长咨询文本。
154
star
16

SafetyBench

Official github repo for SafetyBench, a comprehensive benchmark to evaluate LLMs' safety.
Python
144
star
17

ShieldLM

ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors
Python
139
star
18

cotk

Conversational Toolkit. An Open-Source Toolkit for Fast Development and Fair Evaluation of Text Generation
Python
128
star
19

DA-Transformer

Official Implementation for the ICML2022 paper "Directed Acyclic Transformer for Non-Autoregressive Machine Translation"
Python
114
star
20

PPT

Official Code for "PPT: Pre-trained Prompt Tuning for Few-shot Learning". ACL 2022
Python
104
star
21

CommonsenseStoryGen

Implementation for paper "A Knowledge-Enhanced Pretraining Model for Commonsense Story Generation"
Python
103
star
22

CritiqueLLM

96
star
23

tatk

Task-oriented dialog system toolkits
Python
84
star
24

SentiLARE

Codes for our paper "SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge" (EMNLP 2020)
Python
78
star
25

THUOOP

清华大学面向对象程序设计课程 课程材料及答疑
76
star
26

OPD

OPD: Chinese Open-Domain Pre-trained Dialogue Model
Python
74
star
27

LOT-LongLM

Python
71
star
28

JointGT

Codes for our paper "JointGT: Graph-Text Joint Representation Learning for Text Generation from Knowledge Graphs" (ACL 2021 Findings)
Python
70
star
29

UNION

UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation
Python
57
star
30

OpenMEVA

Benchmark for evaluating open-ended generation
Python
44
star
31

HINT

Python
35
star
32

CTRLEval

Codes for our paper "CTRLEval: An Unsupervised Reference-Free Metric for Evaluating Controlled Text Generation" (ACL 2022)
Python
31
star
33

CPT4DST

Official code for "Continual Prompt Tuning for Dialog State Tracking" (ACL 2022).
Python
28
star
34

seq2seq-pytorch-bert

Python
26
star
35

DiaSafety

This repo is for the paper: On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark
Python
23
star
36

Targeted-Data-Extraction

Official Code for ACL 2023 paper: "Ethicist: Targeted Training Data Extraction Through Loss Smoothed Soft Prompting and Calibrated Confidence Estimation"
Python
23
star
37

TaiLr

ICLR2023 - Tailoring Language Generation Models under Total Variation Distance
Python
20
star
38

SafeUnlearning

Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks
Python
20
star
39

LAUG

Language Understanding Augmentation Toolkit for Robustness Testing
Python
19
star
40

MoralStory

Python
17
star
41

ConPer

Official Code for NAACL 2022 paper: "Persona-Guided Planning for Controlling the Protagonist's Persona in Story Generation"
Python
15
star
42

AugESC

Official repository for the Findings of ACL 2023 paper "AugESC: Dialogue Augmentation with Large Language Models for Emotional Support Conversation"
15
star
43

NAST

Codes for "NAST: A Non-Autoregressive Generator with Word Alignment for Unsupervised Text Style Transfer" (ACL 2021 findings)
Python
14
star
44

CDConv

Data and codes for EMNLP 2022 paper "CDConv: A Benchmark for Contradiction Detection in Chinese Conversations"
Python
13
star
45

JailbreakDefense_GoalPriority

[ACL 2024] Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization
Python
11
star
46

AutoCAD

Official Code for EMNLP 2022 findings paper: "AutoCAD: Automatically Generating Counterfactuals for Mitigating Shortcut Learning"
Python
10
star
47

Implicit-Toxicity

Official Code for EMNLP 2023 paper: "Unveiling the Implicit Toxicity in Large Language Models""
Python
8
star
48

grounded-minimal-edit

Code for EMNLP 2021 paper "Transferable Persona-Grounded Dialogues via Grounded Minimal Edits"
Python
8
star
49

hred-tensorflow

Python
7
star
50

EssayCommentGen

Python
7
star
51

UDIT

Official Code for EMNLP2022 Paper: "Learning Instructions with Unlabeled Data for Zero-Shot Cross-Task Generalization"
Python
7
star
52

Reverse_Generation

Python
6
star
53

earl

This project is a tensorflow implementation of our work, EARL.
Python
6
star
54

MoralDial

The official Implementations of the paper: MoralDial: A Framework to Train and Evaluate Moral Dialogue Systems via Moral Discussions
Python
5
star
55

seqGAN-tensorflow

Python
5
star
56

LaMemo

NAACL2022 - LaMemo: Language Modeling with Look-Ahead Memory
Python
5
star
57

Re3Dial

Official Code for EMNLP 2023 paper: "Re3Dial: Retrieve, Reorganize and Rescale Conversations for Long-Turn Open-Domain Dialogue Pre-training"
Python
5
star
58

ERIC

Code for the AAAI 2023 paper "Generating Coherent Narratives by Learning Dynamic and Discrete Entity States with a Contrastive Framework"
Python
4
star
59

DAG-Search

The beamsearch algorithm for DA-Transformer
C++
4
star
60

cotk_docs

Document for cotk package. Refer to: https://github.com/thu-coai/cotk
Python
4
star
61

lightseq-nat

A Modified Version of LightSeq for Non-Autoregressive Transformer
Cuda
3
star
62

seq2seq-pytorch

Python
3
star
63

SelfCont

Code for the paper "Mitigating the Learning Bias towards Repetition by Self-Contrastive Training for Open-Ended Generation"
Python
3
star
64

CodePlan

3
star
65

transformerLM-pytorch

Python
2
star
66

cotk_dashboard

Dashboard for cotk
JavaScript
2
star
67

GPT2LM-pytorch

Python
2
star
68

ConvLab-2_docs

2
star
69

CVAE-tensorflow

Python
2
star
70

GRULM-pytorch

Python
1
star
71

LM-tensorflow

Python
1
star
72

cotk-test-CVAE

Python
1
star
73

tatk_docs

The document of TaTK platform.
1
star
74

seq2seq-tensorflow

Python
1
star
75

VAE-tensorflow

Python
1
star
76

ComplexBench

Python
1
star
77

cotk_data

1
star
78

SST-pytorch

Python
1
star