• This repository has been archived on 18/Sep/2023
  • Stars
    star
    308
  • Rank 135,712 (Top 3 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 3 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Repository containing code for "How to Train BERT with an Academic Budget" paper

Training BERT with Compute/Time (Academic) Budget

This repository contains scripts for pre-training and finetuning BERT-like models with limited time and compute budget. The code is based on the work presented in the following paper:

Peter Izsak, Moshe Berchansky, Omer Levy, How to Train BERT with an Academic Budget (EMNLP 2021).

Installation

The pre-training and finetuning scripts are based on Deepspeed and HuggingFace Transformers libraries.

Preliminary Installation

We recommend creating a virtual environment with python 3.6+, PyTorch and apex.

Installation Requirements

pip install -r requirements.txt

We suggest running Deepspeed's utility ds_report and verify Deepspeed components can be compiled (JIT).

Dataset

The dataset directory includes scripts to pre-process the datasets we used in our experiments (Wikipedia, Bookcorpus). See dedicated README for full details.

Pretraining

Pretraining script: run_pretraining.py

For all possible pretraining arguments see: python run_pretraining.py -h

We highly suggest reviewing the various training features we provide within the library.

Example for training with the best configuration presented in our paper (24-layers/1024H/time-based learning rate schedule/fp16):
deepspeed run_pretraining.py \
  --model_type bert-mlm --tokenizer_name bert-large-uncased \
  --hidden_act gelu \
  --hidden_size 1024 \
  --num_hidden_layers 24 \
  --num_attention_heads 16 \
  --intermediate_size 4096 \
  --hidden_dropout_prob 0.1 \
  --attention_probs_dropout_prob 0.1 \
  --encoder_ln_mode pre-ln \
  --lr 1e-3 \
  --train_batch_size 4096 \
  --train_micro_batch_size_per_gpu 32 \
  --lr_schedule time \
  --curve linear \
  --warmup_proportion 0.06 \
  --gradient_clipping 0.0 \
  --optimizer_type adamw \
  --weight_decay 0.01 \
  --adam_beta1 0.9 \
  --adam_beta2 0.98 \
  --adam_eps 1e-6 \
  --total_training_time 24.0 \
  --early_exit_time_marker 24.0 \
  --dataset_path <dataset path> \
  --output_dir /tmp/training-out \
  --print_steps 100 \
  --num_epochs_between_checkpoints 10000 \
  --job_name pretraining_experiment \
  --project_name budget-bert-pretraining \
  --validation_epochs 3 \
  --validation_epochs_begin 1 \
  --validation_epochs_end 1 \
  --validation_begin_proportion 0.05 \
  --validation_end_proportion 0.01 \
  --validation_micro_batch 16 \
  --deepspeed \
  --data_loader_type dist \
  --do_validation \
  --use_early_stopping \
  --early_stop_time 180 \
  --early_stop_eval_loss 6 \
  --seed 42 \
  --fp16

Time-based Training

Pretraining can be limited to a time-based value by defining --total_training_time=24.0 (24 hours for example).

Time-based Learning Rate Scheduling

The learning rate can be scheduled to change according to the configured total training time. The argument --total_training_time controls the total time assigned for the trainer to run, and must be specified in order to use time-based learning rate scheduling.

Time-based Learning rate schedule

To select time-based learning rate scheduling, define --lr_schedule time, and define a shape for for the annealing curve (--curve=linear for example, as seen in the figure). The warmup phase of the learning rate is define by specifying a proportion (--warmup_proportion) which accounts for the time-budget proportion available in the training session (as defined by --total_training_time). For example, for a 24 hour training session, warmup_proportion=0.1 would account for 10% of 24 hours, that is, 2.4 hours (or 144 minutes) to reach peak learning rate. The learning rate will then be scheduled to reach 0 at the end of the time budget. We refer to the provided figure for an example.

Checkpoints and Finetune Checkpoints

There are 2 types of checkpoints that can be enabled:

  • Training checkpoint - saves model weights, optimizer state and training args. Defined by --num_epochs_between_checkpoints.
  • Finetuning checkpoint - saves model weights and configuration to be used for finetuning later on. Defined by --finetune_time_markers.

finetune_time_markers can be assigned multiple points in the training time-budget by providing a list of time markers of the overall training progress. For example --finetune_time_markers=0.5 will save a finetuning checkpoint when reaching 50% of training time budget. For multiple finetuning checkpoints, use commas without space 0.5,0.6,0.9.

Validation Scheduling

Enable validation while pre-training with --do_validation

Control the number of epochs between validation runs with --validation_epochs=<num>

To control the amount of validation runs in the beginning and end (running more that validation_epochs) use validation_begin_proportion and validation_end_proportion to specify the proportion of time and, validation_epochs_begin and validation_epochs_end to control the custom values accordingly.

Mixed Precision Training

Mixed precision is supported by adding --fp16. Use --fp16_backend=ds to use Deepspeed's mixed precision backend and --fp16_backend=apex for apex (--fp16_opt controls optimization level).

Finetuning

Use run_glue.py to run finetuning for a saved checkpoint on GLUE tasks.

The finetuning script is identical to the one provided by Huggingface with the addition of our model.

For all possible pretraining arguments see: python run_glue.py -h

Example for finetuning on MRPC:
python run_glue.py \
  --model_name_or_path <path to model> \
  --task_name MRPC \
  --max_seq_length 128 \
  --output_dir /tmp/finetuning \
  --overwrite_output_dir \
  --do_train --do_eval \
  --evaluation_strategy steps \
  --per_device_train_batch_size 32 --gradient_accumulation_steps 1 \
  --per_device_eval_batch_size 32 \
  --learning_rate 5e-5 \
  --weight_decay 0.01 \
  --eval_steps 50 --evaluation_strategy steps \
  --max_grad_norm 1.0 \
  --num_train_epochs 5 \
  --lr_scheduler_type polynomial \
  --warmup_steps 50

Generating Pretraining Commands

We provide a useful script for generating multiple (or single) pretraining commands by using python generate_training_commands.py.

python generate_training_commands.py -h

	--param_file PARAM_FILE Hyperparameter and configuration yaml
  	--job_name JOB_NAME   job name
 	--init_cmd INIT_CMD   initialization command (deepspeed or python directly)

A parameter yaml must be defined with 2 main keys: hyperparameters with argument values defined as a list of possible values, and default_parameters as default values. Each generated command will be a possible combination of the various arguments specified in the hyperparameters section.

Example:

hyperparameters:
  param1: [val1, val2]
  param2: [val1, val2]

default_parameters:
  param3: 0.0

will result in:

deepspeed run_pretraining.py --param1=val1 --param2=val1 --param3=0.0
deepspeed run_pretraining.py --param1=val1 --param2=val2 --param3=0.0
deepspeed run_pretraining.py --param1=val2 --param2=val1 --param3=0.0
deepspeed run_pretraining.py --param1=val2 --param2=val2 --param3=0.0

Citation

If you find this paper or this code useful, please cite this paper:

@inproceedings{izsak-etal-2021-train,
    title = "How to Train {BERT} with an Academic Budget",
    author = "Izsak, Peter  and
      Berchansky, Moshe  and
      Levy, Omer",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.831",
}

More Repositories

1

distiller

Neural Network Distiller by Intel AI Lab: a Python package for neural network compression research. https://intellabs.github.io/distiller
Jupyter Notebook
4,332
star
2

nlp-architect

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks
Python
2,936
star
3

coach

Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms
Python
2,321
star
4

control-flag

A system to flag anomalous source code expressions by learning typical expressions from training data
C++
1,241
star
5

fastRAG

Efficient Retrieval Augmentation and Generation Framework
Python
1,194
star
6

flrc

Haskell Research Compiler
Standard ML
814
star
7

RiverTrail

An API for data parallelism in JavaScript
JavaScript
748
star
8

kAFL

A fuzzer for full VM kernel/driver targets
Makefile
636
star
9

bayesian-torch

A library for Bayesian neural network layers and uncertainty estimation in Deep Learning extending the core of PyTorch
Python
503
star
10

ParallelAccelerator.jl

The ParallelAccelerator package, part of the High Performance Scripting project at Intel Labs
Julia
294
star
11

RAGFoundry

Framework for enhancing LLMs for RAG tasks using fine-tuning.
Python
289
star
12

SkimCaffe

Caffe for Sparse Convolutional Neural Network
C++
238
star
13

pWord2Vec

Parallelizing word2vec in shared and distributed memory
C++
191
star
14

causality-lab

Causal discovery algorithms and tools for implementing new ones
Jupyter Notebook
167
star
15

matsciml

Open MatSci ML Toolkit is a framework for prototyping and scaling out deep learning models for materials discovery supporting widely used materials science datasets, and built on top of PyTorch Lightning, the Deep Graph Library, and PyTorch Geometric.
Python
143
star
16

riscv-vector

Vector Acceleration IP core for RISC-V*
Scala
136
star
17

Model-Compression-Research-Package

A library for researching neural networks compression and acceleration methods.
Python
134
star
18

IntelNeuromorphicDNSChallenge

Intel Neuromorphic DNS Challenge
Jupyter Notebook
126
star
19

MMPano

Official implementation of L-MAGIC
Python
123
star
20

rnnlm

Recurrent Neural Network Language Modeling (RNNLM) Toolkit
C++
121
star
21

HPAT.jl

High Performance Analytics Toolkit (HPAT) is a Julia-based framework for big data analytics on clusters.
Julia
120
star
22

FP8-Emulation-Toolkit

PyTorch extension for emulating FP8 data formats on standard FP32 Xeon/GPU hardware.
Python
90
star
23

ScalableVectorSearch

C++
88
star
24

VL-InterpreT

Visual Language Transformer Interpreter - An interactive visualization tool for interpreting vision-language transformers
Python
84
star
25

vdms

VDMS: Your Favorite Visual Data Management System
C++
82
star
26

SpMP

sparse matrix pre-processing library
C++
81
star
27

SLIDE_opt_ia

C++
74
star
28

CLNeRF

Python
63
star
29

baa-ngp

This repository contains the official Implementation for "BAA-NGP: Bundle-Adjusting Accelerated Neural Graphics Primitives".
Python
56
star
30

autonomousmavs

Framework for Autonomous Navigation of Micro Aerial Vehicles
C++
56
star
31

multimodal_cognitive_ai

research work on multimodal cognitive ai
Python
56
star
32

Latte.jl

A high-performance DSL for deep neural networks in Julia
Julia
53
star
33

AVUC

Code to accompany the paper 'Improving model calibration with accuracy versus uncertainty optimization'.
Python
51
star
34

GraVi-T

Graph learning framework for long-term video understanding
Python
49
star
35

PreSiFuzz

Pre-Silicon Hardware Fuzzing Toolkit
Rust
47
star
36

pmgd

Persistent Memory Graph Database
C++
43
star
37

TSAD-Evaluator

Intel Labs open source repository for time series anomaly detection evaluator
C++
41
star
38

Open-Omics-Acceleration-Framework

Intel lab's open sourced data science framework for accelerating digital biology
Jupyter Notebook
36
star
39

Auto-Steer

Auto-Steer
Python
36
star
40

FloorSet

Jupyter Notebook
34
star
41

SAR

Python
34
star
42

kafl.fuzzer

kAFL Fuzzer
Python
32
star
43

CompilerTools.jl

The CompilerTools package, part of the High Performance Scripting project at Intel Labs
Julia
30
star
44

TinyGarble2.0

C++
29
star
45

t2sp

Productive and portable performance programming across spatial architectures (FPGAs, etc.) and vector architectures (GPUs, etc.)
C++
29
star
46

DyNAS-T

Dynamic Neural Architecture Search Toolkit
Jupyter Notebook
28
star
47

ParallelJavaScript

A collection of example workloads for Parallel JavaScript
HTML
26
star
48

kafl.targets

Target components for kAFL/Nyx Fuzzer
C
25
star
49

continuallearning

Python
25
star
50

iHRC

Intel Heterogeneous Research Compiler (iHRC)
C++
25
star
51

scenario_execution

Scenario Execution for Robotics
Python
25
star
52

flrc-lib

Pillar compiler, Pillar runtime, garbage collector.
C++
23
star
53

lvlm-interpret

Python
23
star
54

iACT

C++
22
star
55

OSCAR

Object Sensing and Cognition for Adversarial Robustness
Jupyter Notebook
20
star
56

MICSAS

MISIM: A Neural Code Semantics Similarity System Using the Context-Aware Semantics Structure
Python
19
star
57

mat2qubit

Python
19
star
58

csg

IV 2020 "CSG: Critical Scenario Generation from Real Traffic Accidents"
Python
18
star
59

Sparso

Julia package for accelerating sparse matrix applications.
Julia
18
star
60

open-omics-alphafold

Python
17
star
61

MART

Modular Adversarial Robustness Toolkit
Python
16
star
62

Trans-Omics-Acceleration-Library

HTML
15
star
63

Hardware-Aware-Automated-Machine-Learning

Jupyter Notebook
15
star
64

kafl.linux

Linux kernel branches for confidential compute research
15
star
65

c3-simulator

C3-Simulator is a Simics-based functional simulator for the X86 C3 processor, including library and kernel support for pointer and data encryption, stack unwinding support for C++ exception handling, debugger enabling, and scripting for running tests.
C++
14
star
66

VectorSearchDatasets

Python
11
star
67

flrc-benchmarks

Benchmarks for use with IntelLabs/flrc.
Haskell
10
star
68

ais-benchmarks

A framework, based on python and numpy, for evaluation of sampling methods
Python
10
star
69

ALTO

A template-based implementation of the Adaptive Linearized Tensor Order (ALTO) format for storing and processing sparse tensors.
C++
10
star
70

hec-p-isa-tools

Intelโ€™s HERACLES accelerator introduces a new set of fundamental instructions, the Polynomial Instructions Set Architecture (P-ISA) that operates directly on polynomials requiring a completely new programming environment. This open-source project aims at developing the building blocks for a compiler toolchain for HERACLES.
Python
10
star
71

PyTorchALFI

Application Level Fault Injection for Pytorch
Python
9
star
72

RiverTrail-interactive

An interactive shell in your browser for writing and running River Trail programs
JavaScript
8
star
73

gma

Linux Client & Server Software to support Generic Multi-Access Network Virtualization
C++
8
star
74

dfm

DFM (Deep Feature Modeling) is an efficient and principled method for out-of-distribution detection, novelty and anomaly detection.
Python
7
star
75

SOI_FFT

Segment-of-interest low-communication FFT algorithm
C
7
star
76

vcl

DEPRECATED - No longer maintained. Updates are will be provided through the VDMS project
C++
6
star
77

DATSA

DATSA
C++
6
star
78

Hybrid-Quantum-Classical-Library

Hybrid Quantum-Classical Library (HQCL)
C++
6
star
79

spic

Semantic Preserving Image Compression
Python
6
star
80

generative-ai

Intel Generative Image Model Benchmark
Jupyter Notebook
6
star
81

Optimized-Implementation-of-Word-Movers-Distance

C++
6
star
82

token_elimination

Python
6
star
83

NeuroCounterfactuals

Jupyter Notebook
5
star
84

c3-glibc

C
5
star
85

PolarFly

Source code repository for paper being presented at Super Computing 22 Conference.
C++
5
star
86

aspect-extraction

Pattern Based Aspect Term Extraction
Python
5
star
87

networkgym

NetworkGym is a Simulation-aaS framework to support Network AI algorithm development by providing high-fidelity full-stack e2e network simulation in cloud and allowing AI developers to interact with the simulated network environment through open APIs.
C++
5
star
88

Latte.py

Python
5
star
89

HDFIT

HDFIT (Hardware Design Fault Injection Toolkit) Github documentation pages.
5
star
90

TME-MK-Fine-Grained-Encryption-Integrity

Makefile
5
star
91

EquiTriton

EquiTriton is a project that seeks to implement high-performance kernels for commonly used building blocks in equivariant neural networks, enabling compute efficient training and inference.
Python
4
star
92

Incremental-Neural-Videos-with-PyTorch

Incremental-Neural-Videos-with-PyTorch*
Python
4
star
93

kafl.qemu

4
star
94

simics-plus-rtl

This project contains the Chisel code for a CRC32 datapath alongside a skeleton PCI component in Simics DML which connects to the C++ conversion of the CRC32 datapath.
Scala
4
star
95

Chisel-cocotb-Examples

This project contains generic example hardware modules and their testbenches written in Chisel and cocotb to demonstrate an agile hardware development methodology.
Python
4
star
96

LogReplicationRocksDB

C++
4
star
97

emp-ot

C++
3
star
98

kafl.libxdc

C
3
star
99

kafl.actions

Github actions for KAFL
Python
3
star
100

emp-tool

C++
3
star