• Stars
    star
    161
  • Rank 233,470 (Top 5 %)
  • Language
    Python
  • Created over 5 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Transformer language model (GPT-2) with sentencepiece tokenizer

Training GPT-2 transformer language model with sentencepiece tokenizer

Build Status

Training GPT-2 transformer language model on your own corpora with sentencepiece tokenization.

This repo contains a PyTorch implementation of GPT-2, which support multi-GPU training. It also contains a TensorFlow implementation in lm/gpt_2_tf, but it is not developed any more. They share the same data preparation scripts. TF training command is gpt-2-tf-train and needs TensorFlow 1.13. Documentation below is for PyTorch version.

Installation

Python 3.6+ is required with torch nightly or 1.6.0+. Working in a virtualenv is assumed below. Install appropriate version of pytorch first, and then:

pip install -r requirements.txt
python setup.py develop

Usage

Instructions are below. See also test/test_shakespeare.sh for a complete pipeline demo on a small corpus (takes a minute on a CPU).

Prepare data for training

Corpus format: a directory with top-level train, valid and test folders. Each top-level folder may contain sub-folders. Inside them, there must be utf-8 encoded text files with .txt extension.

The commands to train sentencepiece model and encode the corpus support multiple corpora, in below examples we assume they can be listed as data/corpora-*.

  1. Train sentencepiece model (sp-text.txt can be removed after running). This can consume a large amount of memory, adjust sentencepiece arguments as advised if needed (this is not supported in the sp-train command directly):

    sp-train data/corpora-* sp-text.txt sp-model
    
  2. Encode corpora, producing numpy files:

    sp-encode data/corpora-* sp-model.model data/encoded
    

Training

Example command:

gpt-2 run-root data/encoded sp-model.model

run-root would contain model checkpoints and json-lines logs, which can be plotted in a jupyter notebook with json_log_plots.plot("run-root"), with number of tokens seen on the X axis.

Default hyperparameters correspond to released "small" GPT-2 model.

When multiple GPUs are available, they would be used for training with the help of torch.distributed.

If the path exists and --clean key is NOT passed, training would be resumed. Note that all parameters still need to be specified and model parameters need to match.

Notes on training parameters:

  • --batch-size is per-GPU, so you don't need to re-tune it when changing number of GPUs, just use max that fits into memory.
  • --g-accum-gradients is the global number of gradient accumulations, it must be divisible by the number of GPUs. Effective global batch size is always batch_size * g_accum_gradients.
  • --lr does not need to be changed when changing --batch-size or --g-accum-gradients or number of GPUs or --n-ctx: loss is already scaled appropriately.

Inference

Example command:

gpt-2-gen run-root "Artificial intelligence"

run-root would contain model checkpoints "Artificial intelligence" is the text prefix used as a starting point for generating tokens

Notes on inference parameters:

  • --tokens-to-generate: number of tokens to generate, default is 42
  • --top-k: number of token candidates to generate for each position (beam width), default is 8.

License & credits

License is MIT.

TensorFlow GPT-2 model is taken from https://github.com/openai/gpt-2/blob/master/src/model.py and TensorFlow GPT-2 training code is based on https://github.com/nshepperd/gpt-2/blob/finetuning/train.py

PyTorch port is based on original OpenAI code.

Test Shakespeare corpus under tests/shakespeare is from http://shakespeare.mit.edu under public domain.

See also OpenAI GPT-2 paper and blog.

More Repositories

1

kaggle-dstl

Kaggle DSTL Satellite Imagery Feature Detection
Python
208
star
2

kaggle-dsbowl-2018-dataset-fixes

Kaggle Data Science Bowl 2018 dataset fixes
Python
131
star
3

kaggle-script-template

Kaggle script build system template
Python
112
star
4

kaggle-imet-2019

Python
91
star
5

kaggle-kuzushiji-2019

Kaggle Kuzushiji Recognition: 2nd place solution
Python
80
star
6

python-adagram

AdaGram (adaptive skip-gram) for Python
Python
75
star
7

scrapy-pyppeteer

Use pyppeteer from a Scrapy spider
Python
60
star
8

kaggle-lions-2017

2nd place solution for Kaggle NOAA Fisheries Steller Sea Lion Population Count
Jupyter Notebook
36
star
9

tf-rnn-char-autoencoder

RNN Character Autoencoder built with TensorFlow
Python
23
star
10

mapillary-vistas-2017

WIP
Python
20
star
11

kaggle-rsna-2019

https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection/
Python
18
star
12

kaggle-jigsaw-2019

Python
14
star
13

kaggle-panda-2020

Python
7
star
14

kings-problem

Calculate the number of non-attacking kings placements of n*n kings on 2*n x 2*n board
C++
5
star
15

kaggle-fashion-2019-mm

Python
4
star
16

gram-matcher

Match lexico-grammer templates with natural language sentences, using pymorphy
Python
4
star
17

tpu-imagenet

Train on ImageNet on Kaggle TPUs
Python
4
star
18

sensefreq

Sense frequencies and WSD
Python
4
star
19

kaggle-amazon-2017

Solution for Kaggle challange "Planet: Understanding the Amazon from Space"
Jupyter Notebook
4
star
20

rust-broad-crawl

A very basic broad crawler in rust
Rust
3
star
21

ru-jungle

Python
3
star
22

python-emacs-stuff

Various python emacs stuff
Emacs Lisp
2
star
23

ruslang-wsd-labeled

Labeled contexts of Russian polysemous words
Python
2
star
24

vmprofit

VMprof helpers
Python
2
star
25

symbmath

Symolic mathematics in C++ - diffirentiating, expression simplification, integrating
C++
2
star
26

3dbilliard

JavaScript
1
star
27

interactive-opengl

Interactive OpenGL console (educational)
Python
1
star
28

trac-ajaxcomments

Make trac ticket page more convinient: post, update and merge comments without reloading the page
JavaScript
1
star
29

SmartSketcher

A drawing tool that aims at replacing a notepad for doing creative work
Java
1
star
30

wp-doc-gen

Python
1
star
31

clj-grantt

Planning work described by grantt diagram, according to given constraints
Clojure
1
star
32

web-adagram

AdaGram model on the web
Python
1
star
33

meteor-bugtracker

Simple BugTracker app - testing Meteor framework
JavaScript
1
star
34

dbfpy

porting to python3
Python
1
star
35

WSI

Word Sense Induction
Jupyter Notebook
1
star
36

imagenet-exp

ImageNet experiments
Python
1
star
37

sense-grouping-survey

Sense grouping survey
Python
1
star
38

dynide

Python Dyncamic IDE - test and re-run your code without leaving your favorite editor
Python
1
star