GLM
GLM is a General Language Model pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language understanding and generation tasks.
Please refer to our paper for a detailed description of GLM:
GLM: General Language Model Pretraining with Autoregressive Blank Infilling (ACL 2022)
Zhengxiao Du*, Yujie Qian*, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, Jie Tang (*: equal contribution)
News: We release ChatGLM-6B, an open pre-trained language model with 6 billion parameters optimized for Chinese QA and dialogue based on the GLM framework.
Pretrained Models
You can download the pretrained models used in the paper from OneDrive or Tsinghua-Cloud.
Name | Params | Language | Corpus | Objective | File | Config |
---|---|---|---|---|---|---|
GLM-Base | 110M | English | Wiki+Book | Token | glm-base-blank.tar.bz2 | model_blocklm_base.sh |
GLM-Large | 335M | English | Wiki+Book | Token | glm-large-blank.tar.bz2 | model_blocklm_large.sh |
GLM-Large-Chinese | 335M | Chinese | WuDaoCorpora | Token+Sent+Doc | glm-large-chinese.tar.bz2 | model_blocklm_large_chinese.sh |
GLM-Doc | 335M | English | Wiki+Book | Token+Doc | glm-large-generation.tar.bz2 | model_blocklm_large_generation.sh |
GLM-410M | 410M | English | Wiki+Book | Token+Doc | glm-1.25-generation.tar.bz2 | model_blocklm_1.25_generation.sh |
GLM-515M | 515M | English | Wiki+Book | Token+Doc | glm-1.5-generation.tar.bz2 | model_blocklm_1.5_generation.sh |
GLM-RoBERTa | 335M | English | RoBERTa | Token | glm-roberta-large-blank.tar.bz2 | model_blocklm_roberta_large.sh |
GLM-2B | 2B | English | Pile | Token+Sent+Doc | glm-2b.tar.bz2 | model_blocklm_2B.sh |
GLM-10B | 10B | English | Pile | Token+Sent+Doc | Download | model_blocklm_10B.sh |
GLM-10B-Chinese | 10B | Chinese | WuDaoCorpora | Token+Sent+Doc | Download | model_blocklm_10B_chinese.sh |
Unzip the downloaded file into a local folder and set CHECKPOINT_PATH
in the corresponding scripts to the folder path.
Results
SuperGLUE
dev set, single model, single-task finetuning
Model | COPA | WSC | RTE | WiC | CB | MultiRC | BoolQ | ReCoRD |
---|---|---|---|---|---|---|---|---|
GLM-10B | 98.0 | 95.2 | 93.1 | 75.7 | 98.7/98.2 | 88.1/63.3 | 88.7 | 94.4/94.0 |
DeBERTa-XXLarge-v2 | 97.0 | - | 93.5 | - | - | 87.8/63.6 | 88.3 | 94.1/93.7 |
Seq2Seq
CNN/Daily Mail (test set, no additional data used)
Model | ROUGE-1 | ROUGE-2 | ROUGE-L |
---|---|---|---|
GLM-10B | 44.7 | 21.4 | 41.4 |
T5-11B | 43.5 | 21.6 | 40.7 |
PEGASUS-Large | 44.2 | 21.5 | 41.4 |
BART-Large | 44.2 | 21.3 | 40.9 |
XSum (test set, no additional data used)
Model | ROUGE-1 | ROUGE-2 | ROUGE-L |
---|---|---|---|
GLM-10B | 48.9 | 25.7 | 40.4 |
PEGASUS-Large | 47.2 | 24.6 | 39.3 |
BART-Large | 45.1 | 22.3 | 37.3 |
Language Modeling
test set, zero-shot
Model | LAMBADA (accuracy) | Wikitext103 (perplexity) |
---|---|---|
GLM-10B (bi) | 72.35 | 11.33 |
GLM-10B (uni) | 67.18 | 12.22 |
GPT-2 | 52.66 | 17.48 |
Megatron-LM (8.3B) | 66.51 | 10.81 |
Turing-NLG | 67.98 | 10.21 |
Get Started
Hugging Face Hub
You can access GLM models via HuggingFace Hub. Please
install transformers>=4.23.1
and find all the available models here.
Generation
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-10b", trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained("THUDM/glm-10b", trust_remote_code=True)
model = model.half().cuda()
model.eval()
# Inference
inputs = tokenizer("Ng is an adjunct professor at [MASK] (formerly associate professor and Director of its Stanford AI Lab or SAIL ). Also a pioneer in online education, Ng co-founded Coursera and deeplearning.ai.", return_tensors="pt")
inputs = tokenizer.build_inputs_for_generation(inputs, max_gen_length=512)
inputs = inputs.to('cuda')
outputs = model.generate(**inputs, max_length=512, eos_token_id=tokenizer.eop_token_id)
print(tokenizer.decode(outputs[0].tolist()))
# Training
inputs = tokenizer(
["Tsinghua University is located in [MASK].", "One minus one equals zero, is it correct? Answer: [MASK]"],
return_tensors="pt", padding=True)
inputs = tokenizer.build_inputs_for_generation(inputs, targets=["Beijing", "No"], max_gen_length=8, padding=False)
inputs = inputs.to('cuda')
outputs = model(**inputs)
loss = outputs.loss
logits = outputs.logits
Classification
from transformers import AutoTokenizer, AutoModelForMultipleChoice
tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-10b", trust_remote_code=True)
model = AutoModelForMultipleChoice.from_pretrained("THUDM/glm-10b", trust_remote_code=True)
model = model.half().cuda()
model.eval()
inputs = tokenizer(["Tsinghua University is located in [MASK].",
"One minus one equals zero, is it correct? Answer: [MASK]"], return_tensors="pt", padding=True)
choices = [["Beijing", "Shanghai"], ["Yes", "No"]]
inputs = tokenizer.build_inputs_for_multiple_choice(inputs, choices)
inputs = inputs.to('cuda')
outputs = model(**inputs)
logits = outputs.logits
You can also convert the finetuned checkpoints with scripts/convert_glm_checkpoint_to_transformers.py
.
Docker Image
We prepare two docker images based on CUDA 10.2 and CUDA 11.2. You can pull the pre-built images from Docker Hub and run with docker v19.03+
docker run --gpus all --rm -it --ipc=host zxdu20/glm-cuda102
or replace glm-cuda102
with glm-cuda112
.
You can also modify the image according to your requirements in docker/cuda102.dockerfile and build the image yourself
docker build -f cuda102.dockerfile . -t glm-cuda102
Manual Installation
Please first install PyTorch (we use 1.7.0) and apex, and then install other
dependencies by pip install -r requirements.txt
Clone this repo
git clone https://github.com/THUDM/GLM
cd GLM
Model Parallelism
If your encounter the CUDA out of memory
error, which means you GPU memory is limited, you can try the model
parallelism to divide the parameters into multiple GPUs. Take the two-way model parallelism as an example. First
run change_mp.py
to divide the checkpoint:
python change_mp.py path_to_the_checkpoint 2
Then update the checkpoint path in the model config file (such
as config_tasks/model_blocklm_10B.sh) and change MP_SIZE
in the script (such
as scripts/ds_finetune_superglue.sh) to 2
.
Usage
We provide scripts for finetuning GLM on some downstream tasks.
Left-to-Right Generation / Blank Filling (Interactive)
- Change
CHECKPOINT_PATH
to your local path. Run the following script
bash scripts/generate_block.sh \
config_tasks/model_blocklm_10B_chinese.sh
Some models (GLM-2B, GLM-10B, and GLM-10B-Chinese) use three different mask tokens: [MASK]
for short blank
filling, [sMASK]
for sentence filling, and [gMASK]
for left-to-right generation.
Examples
[MASK]
(Entity Prediction):
Usage of Example1
Context: Ng is an adjunct professor at [MASK] (formerly associate professor and Director of its Stanford AI Lab or SAIL ). Also a pioneer in online education, Ng co-founded Coursera and deeplearning.ai.
GLM: the stanford university
Example2 (Chinese)
Context: ๅฏๆ้จไฝไบๆๅคงๅฉ็ฑณๅ ฐๅธๅคๅๅ กๆใ1807ๅนดไธบ็บชๅฟต[MASK]่ๅปบ๏ผ้จ้ซ25็ฑณ๏ผ้กถไธ็็ซไธคๆญฆๅฃซ้้ๅคๅ ต่ฝฆ้ธๅใ
GLM:ๆฟ็ ดไปๅ้ๆปๅ ็ฑณๅ ฐๅ
[sMASK]
(Sentence Prediction)
Usage of Example3
Context: There have been various types of pretraining architectures including autoencoding models (e.g., BERT), autoregressive models (e.g., GPT), and encoder-decoder models (e.g., T5). [sMASK] We propose a General Language Model ( GLM) based on autoregressive blank infilling to address this challenge. GLM improves blank filling pretraining by adding 2D positional encodings and allowing an arbitrary order to predict spans, which results in performance gains over BERT and T5 on NLU tasks. Meanwhile, GLM can be pretrained for different types of tasks by varying the number and lengths of blanks. On a wide range of tasks across NLU, conditional and unconditional generation, GLM outperforms BERT, T5, and GPT given the same model sizes and data, and achieves the best performance from a single pretrained model with 1.25ร parameters of BERT Large, demonstrating its generalizability to different downstream tasks.
GLM: However, there is a growing need to develop a single pretraining model that is not only good at natural language understanding (NLU) or dialog generation/generation (dialog), but is also able to predict other tasks such as sentiment analysis, conditional generation, or machine translation (MT).
Example4 (Chinese)
Context: ๅทฅไธไบ่็ฝ๏ผIndustrial Internet๏ผๆฏๆฐไธไปฃไฟกๆฏ้ไฟกๆๆฏไธๅทฅไธ็ปๆตๆทฑๅบฆ่ๅ็ๆฐๅๅบ็ก่ฎพๆฝใๅบ็จๆจกๅผๅๅทฅไธ็ๆ๏ผ้่ฟๅฏนไบบใๆบใ็ฉใ็ณป็ป็ญ็ๅ จ้ข่ฟๆฅ๏ผๆๅปบ่ตท่ฆ็ๅ จไบงไธ้พใๅ จไปทๅผ้พ็ๅ จๆฐๅถ้ ๅๆๅกไฝ็ณป๏ผไธบๅทฅไธไน่ณไบงไธๆฐๅญๅใ็ฝ็ปๅใๆบ่ฝๅๅๅฑๆไพไบๅฎ็ฐ้ๅพ๏ผๆฏ็ฌฌๅๆฌกๅทฅไธ้ฉๅฝ็้่ฆๅบ็ณใ[sMASK] ๅฎไปฅ็ฝ็ปไธบๅบ็กใๅนณๅฐไธบไธญๆขใๆฐๆฎไธบ่ฆ็ด ใๅฎๅ จไธบไฟ้๏ผๆขๆฏๅทฅไธๆฐๅญๅใ็ฝ็ปๅใๆบ่ฝๅ่ฝฌๅ็ๅบ็ก่ฎพๆฝ๏ผไนๆฏไบ่็ฝใๅคงๆฐๆฎใไบบๅทฅๆบ่ฝไธๅฎไฝ็ปๆตๆทฑๅบฆ่ๅ็ๅบ็จๆจกๅผ๏ผๅๆถไนๆฏไธ็งๆฐไธๆใๆฐไบงไธ๏ผๅฐ้ๅกไผไธๅฝขๆใไพๅบ้พๅไบงไธ้พใๅฝๅ๏ผๅทฅไธไบ่็ฝ่ๅๅบ็จๅๅฝๆฐ็ปๆต้็น่กไธๅนฟๆณๆๅฑ๏ผๅฝขๆๅนณๅฐๅ่ฎพ่ฎกใๆบ่ฝๅๅถ้ ใ็ฝ็ปๅๅๅใไธชๆงๅๅฎๅถใๆๅกๅๅปถไผธใๆฐๅญๅ็ฎก็ๅ ญๅคงๆฐๆจกๅผ๏ผ่ต่ฝใ่ตๆบใ่ตๅผไฝ็จไธๆญๆพ็ฐ๏ผๆๅ็ไฟ่ฟไบๅฎไฝ็ปๆตๆ่ดจใๅขๆใ้ๆฌใ็ปฟ่ฒใๅฎๅ จๅๅฑใ
GLM: ๅทฅไธไบ่็ฝๆฏๅถ้ ไธๆๆฏใ็ฎก็ใๆจกๅผ็้ๅคงๅ้ฉ,ๆฏๆจๅจไบ่็ฝใๅคงๆฐๆฎใไบบๅทฅๆบ่ฝๅๅฎไฝ็ปๆตๆทฑๅบฆ่ๅ็้่ฆ่ฝฝไฝ,ๆฏๅปบ่ฎพๅถ้ ๅผบๅฝๅ็ฝ็ปๅผบๅฝ็้่ฆๅบ็กใ
[gMASK]
(Long Text Generation)
Usage of Example5 (Chinese)
Context: ้ฎ้ข๏ผๅฌๅคฉ๏ผไธญๅฝๅชๅบงๅๅธๆ้ๅ้ฟๅฏ๏ผ้ฎ้ขๆ่ฟฐ๏ผ่ฝๆจ่ไธไบๅฝๅ ้ๅๅฌๅคฉ้ฟๅฏ็ๅๅธๅ๏ผๅ็ญ็จๆท๏ผๆ ๆธธ็ฑๅฅฝ่ ๅ็ญ๏ผ [gMASK]
GLM: ๆตทๅไธไบ,ไธไบ็ๅฌๅคฉๆฏไธไบ็ๆบๅญฃ,ๅ ไธบๅฌๅคฉๅฝๅ ๅคง้จๅๅฐๆน้ฝๆ้พ้พ,่ไธไบๅดๆฏๆฒกๆ้พ้พ็ๅฐๆน,ๆไปฅไธไบๆฏๅฌๅคฉๆๆฌๆไธ่ฟ็ๅฐๆนไบใๅจไธๅ้ฟๅคง็ๆ่งๅพไธไบ็ฎ็ดๅฐฑๆฏไธไธชๅคฉๅ ,ๅฌๅคฉๅปๆตทๅฃไธไบๆ ๆธธ,ไบซๅๅฌๅคฉ,ไบซๅ้ณๅ ๆฒๆปฉใไฝๆฏๆตทๅฃๅดๆฏๅพๅนฒ็ฅ,้่ฆๅคๅๆฐดใ ไธไบๅฌๅคฉๆ็ญ้จ็็ฉๆณๅฐฑๆฏๆๅคช้ณ,ๅจๆตท่พนๆๆๅคช้ณ,ๅจๆฒๆปฉไธๆๆๅคช้ณ,ๆ่งๆดไธชไบบ้ฝๅพๅฐไบ่งฃๆพใไธไบ่ฟๆไธไธช็น่ฒ้กน็ฎ,ๅฐฑๆฏๆตทไธๅฒๆตช,ๅจไธไบ็ๆฒๆปฉไธๅฒๆตชไนๆฏไธไปถ้ๅธธๅบๆฟ็ไบๆ ใ ๆตทๅฃ,ๆตทๅฃๅฌๅญฃ็้ณๅ ๅๅๆธฉๆ,ๆตทๅ็ๅฌๅญฃไนๆฏๅฑไบๅฌๅญฃๆ ๆธธ็ๆบๅญฃใๅฌๅญฃ็ๆตทๅฃๆๆฃ็ๆฏๅปๆตทๅ็็ญๅธฆ้็ๅจๆค็ฉๅญ,้ฃ้ๆๆฐไนไธๅฐฝ็็ญๅธฆๅฐๅจ็ฉ,ๅจ่ฟ้ๅฏไปฅ่ฟ่ท็ฆป็ๅๅฎไปฌๆฅ่งฆ,ๆตทๅ็็ญๅธฆ้็ๅจๆค็ฉๅญไนๆฏๆตทๅ็ๅคฉ็ถๆฐงๅงใ่ฟๅฏไปฅๅจๆตทๅฃ่งๆพๆนๅ ฌๅญ้ๆๅๆตทๅฃ็พไธฝ็ๆตทๆฏใ ่ดต้ณ,่ดตๅท็ๅฌๅคฉไนๆฏๅๅๆธฉๆ็,่ดต้ณไนๆฏๅฌๅญฃ้ฟๅฏๅพๅฅฝ็ๅๅธไนไธใๅฌๅญฃๅป่ดต้ณ็ฉไธๅฎ่ฆๅป้ป็ตๅฑฑ,้ป็ตๅฑฑๆฏ่ดตๅท้ฆ็ซๅพๆบ็็ไธไธชๅฏบๅบ,ๅฏบๅบ็ๅฌๅญฃ้ฆ็ซ้ผ็,ๅจๅฌๅญฃๅปๅฏบๅบๆธธ็ฉไนๆฏไธไธชๅพๅฅฝ็ไฝ้ชใ้คไบ้ป็ตๅฑฑ,่ดต้ณๅจๅฌๅญฃ่ฟๆ่ฑๆบชๅ ฌๅญๅฏไปฅๅป็ฉ,่ฑๆบชๅ ฌๅญไนๆฏๅปๅฝๅฐๅ ฌๅญ็ฉๆๅฅฝ็้ๆฉใ ้ๅฒ,้ๅฒ็ๅฌๅคฉๆฏ้ๅฒๆ่ๆ็ๆถๅ,้ๅฒๆๅพๅคๆตทๆปจๆตดๅบ,ๅฌๅคฉๅปๆตท่พนๆณกไธๆณกๆธฉๆณ,็ถๅๆๆๅคช้ณๆฏไธไปถๅๅๆฌๆ็ไบๆ ใ้ๅฒไนๆๆฒๆปฉ,ๅฌๅคฉๅจๆฒๆปฉไธๆๆๅคช้ณ,็็ๆตท,ๅ็ฉ็ฉๆฒๆปฉๆธธๆ,ๆ่งๅๅๅฟซไน็ไบใ
You can also add multiple [MASK]
and [sMASK]
in a single example. The model will fill the blanks one by one from left to right. The answer to each blank always begins with a special <|startofpiece|>
.
Examples
Example1
Context: There have been various types of pretraining architectures including autoencoding models (e.g., BERT), autoregressive models (e.g., GPT), and [MASK] (e.g., T5). [sMASK] We propose a General Language Model ( GLM) based on autoregressive blank infilling to address this challenge. GLM improves blank filling pretraining by adding 2D positional encodings and allowing an arbitrary order to predict spans, which results in performance gains over [MASK] on NLU tasks. Meanwhile, GLM can be pretrained for different types of tasks by varying the number and lengths of blanks. On a wide range of tasks across NLU, conditional and [MASK], GLM outperforms BERT, T5, and GPT given the same model sizes and data, and achieves the best performance from a single pretrained model with 1.25ร parameters of BERT Large , demonstrating its generalizability to different downstream tasks.
GLM: <|startofpiece|> blank filling models<|startofpiece|> However, most of them cannot easily transfer to other downstream tasks due to the different characteristics of these tasks.<|startofpiece|> other pretrained models<|startofpiece|> unconditional reading, and semantic role labeling tasks
Example2 (Chinese)
Context: ๅทฅไธไบ่็ฝ๏ผIndustrial Internet๏ผๆฏๆฐไธไปฃ[MASK]ไธ[MASK]ๆทฑๅบฆ่ๅ็ๆฐๅๅบ็ก่ฎพๆฝใๅบ็จๆจกๅผๅๅทฅไธ็ๆ๏ผ้่ฟๅฏนไบบใๆบใ็ฉใ็ณป็ป็ญ็ๅ จ้ข่ฟๆฅ๏ผๆๅปบ่ตท่ฆ็ๅ จไบงไธ้พใๅ จไปทๅผ้พ็ๅ จๆฐๅถ้ ๅๆๅกไฝ็ณป๏ผไธบๅทฅไธไน่ณไบงไธๆฐๅญๅใ็ฝ็ปๅใๆบ่ฝๅๅๅฑๆไพไบๅฎ็ฐ้ๅพ๏ผๆฏ็ฌฌๅๆฌกๅทฅไธ้ฉๅฝ็้่ฆๅบ็ณใ[sMASK] ๅฎไปฅ็ฝ็ปไธบๅบ็กใๅนณๅฐไธบไธญๆขใๆฐๆฎไธบ่ฆ็ด ใๅฎๅ จไธบไฟ้๏ผๆขๆฏๅทฅไธๆฐๅญๅใ็ฝ็ปๅใๆบ่ฝๅ่ฝฌๅ็ๅบ็ก่ฎพๆฝ๏ผไนๆฏไบ่็ฝใๅคงๆฐๆฎใไบบๅทฅๆบ่ฝไธๅฎไฝ็ปๆตๆทฑๅบฆ่ๅ็ๅบ็จๆจกๅผ๏ผๅๆถไนๆฏไธ็งๆฐไธๆใๆฐไบงไธ๏ผๅฐ้ๅกไผไธๅฝขๆใไพๅบ้พๅไบงไธ้พใๅฝๅ๏ผๅทฅไธไบ่็ฝ่ๅๅบ็จๅๅฝๆฐ็ปๆต้็น่กไธๅนฟๆณๆๅฑ๏ผๅฝขๆ[MASK]ใๆบ่ฝๅๅถ้ ใ[MASK]ใไธชๆงๅๅฎๅถใๆๅกๅๅปถไผธใๆฐๅญๅ็ฎก็ๅ ญๅคงๆฐๆจกๅผ๏ผ่ต่ฝใ่ตๆบใ่ตๅผไฝ็จไธๆญๆพ็ฐ๏ผๆๅ็ไฟ่ฟไบๅฎไฝ็ปๆตๆ่ดจใๅขๆใ้ๆฌใ็ปฟ่ฒใๅฎๅ จๅๅฑใ
GLM: <|startofpiece|>ไฟกๆฏๆๆฏ(ICT)<|startofpiece|>ๅทฅไธ็ปๆต(II2O)<|startofpiece|>ๆๅฝๅทฅไธไบ่็ฝๆฏ้ขๅๅทฅไธๅ จ้ขๅใๅ จๆต็จใๅ จไฝ็ณป็ไบ่็ฝ,ๅ ทๆๅคไบงไธใๅค้ขๅ่ๅ็็น็นใ<|startofpiece|>็ฝ็ปๅๅๅ<|startofpiece|>ๅนณๅฐไผไธ
SuperGLUE
-
Download the SuperGlue data and check the experiment setup in scripts/ds_finetune_superglue.sh. Note that
DATA_ROOT, CHECKPOINT_PATH, SAVE_PATH
need to be changed to your local path. You may also change thebatch-size
andnproc_per_node
according to your available hardware. -
Run the following script (use the COPA dataset as an example)
bash scripts/ds_finetune_superglue.sh \
config_tasks/model_blocklm_10B.sh \
config_tasks/task_copa.sh
- We also implement P-Tuning in our code. Run the following script to integrate p-tuning:
bash scripts/ds_finetune_superglue_prompt.sh \
config_tasks/model_blocklm_10B.sh \
config_tasks/task_copa.sh
- To apply GLM to a new NLU dataset with cloze-filling finetuning, implement a
DataProcessor
in tasks/superglue/dataset.py for data loading and add aPVP
in tasks/superglue/pvp.py for the cloze question. More details can be found here.
Seq2Seq
-
Download the Gigaword , CNN/Daily Mail or XSum dataset and check the experiment setup in scripts/ds_finetune_seq2seq.sh. Change
DATA_ROOT, CHECKPOINT_PATH, SAVE_PATH
to your local path. -
Run the following script (use the CNN/Daily Mail dataset as an example)
bash scripts/ds_finetune_seq2seq.sh \ config_tasks/model_blocklm_10B.sh \ config_tasks/seq_cnndm_org.sh
-
The summaries are written into
./runs/experiment_name/test.jsonl.hyps
. The references are written intotest.jsonl.refs
in the same directory. For calculating rouge, install file2rouge and download Stanford CoreNLP from here. Run the following scriptbash scripts/evaluate_seq2seq.sh \ ./runs/experiment_name/test.jsonl.hyps ./runs/experiment_name/test.jsonl.refs
Train with your own data
Process your seq2seq data into {split}.source
and {split}.target
, with each line being the context or the target of
a sample, and split
being train
, val
, and test
.
Run the following script
bash scripts/ds_finetune_seq2seq.sh \
config_tasks/model_blocklm_10B.sh \
config_tasks/seq_customization.sh
You can specify the hyperparameters in config_tasks/seq_customization.sh
and config_tasks/config_blocklm_10B_cnndm.json
Multiple Choice (Zero-shot)
bash scripts/evaluate_multichoice.sh config_tasks/model_blocklm_10B.sh
Note that CHECKPOINT_PATH
and DATA_PATH
need to be changed to your local path.
The format of each line of the data file should be
{"inputs_pretokenized": "Context and question here", "choices_pretokenized": ["Choice 1", "Choice 2", "Choice 3"], "label": int}
Language Modeling
LAMBADA Cloze Accuracy
- Download the LAMBADA data and change
DATA_ROOT, CHECKPOINT_PATH
in scripts/evaluate_lm.sh - Run the following script
bash scripts/evaluate_lm.sh \
config_tasks/model_blocklm_large_generation.sh \
config_tasks/zero_lambada.sh
LM Perplexity
- Download
our test set of wikibook
or Wikitext103 dataset and
change
DATA_ROOT, CHECKPOINT_PATH
in scripts/evaluate_lm.sh - Run the following script
bash scripts/evaluate_lm.sh \ config_tasks/model_blocklm_large_generation.sh \ config_tasks/zero_wikitext.sh
Text Infilling
-
Download the Yahoo dataset and check the experiment setup in scripts/finetune_blank.sh. Change
DATA_ROOT, CHECKPOINT_PATH, SAVE_PATH
to your local path. -
Run the following script
bash scripts/finetune_blank.sh \
config_tasks/model_blocklm_large.sh \
config_tasks/seq_blank.sh
Pretrain
Run the following script to pre-train the GLM-Large model
bash scripts/ds_pretrain_nvidia.sh config/ds_block_large.sh
The script scripts/ds_pretrain_nvidia.sh launches the training program with DeepSpeed.
You should change NUM_WORKERS
and NUM_GPUS_PER_WORKER
to the number of workers and the number of gpus per worker.
Also change HOST_FILE_PATH
to the path to an OpenMPI-style hostfile. More details about DeepSpeed launcher can be
found here.
The file config/ds_block_large.sh defines the hyperparameters for pretraining. Most of the
arguments are fairly self-explanatory. Specifically, --train-data
can be multiple keywords defined in NAMED_CORPORA
in data_utils/corpora.py. The hyperparameters of the optimizer are defined in the corresponding
json file under config
. The semantics of the json file can be found here.
Citation
Part of the code is based on Megatron-LM and PET.
Please cite our paper if you find this code useful for your research:
@article{DBLP:conf/acl/DuQLDQY022,
author = {Zhengxiao Du and
Yujie Qian and
Xiao Liu and
Ming Ding and
Jiezhong Qiu and
Zhilin Yang and
Jie Tang},
title = {{GLM:} General Language Model Pretraining with Autoregressive Blank Infilling},
booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), {ACL} 2022, Dublin, Ireland,
May 22-27, 2022},
pages = {320--335},
publisher = {Association for Computational Linguistics},
year = {2022},
}