• Stars
    star
    1,621
  • Rank 27,685 (Top 0.6 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 6 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Tensorflow implementation of contextualized word representations from bi-directional language models

bilm-tf

Tensorflow implementation of the pretrained biLM used to compute ELMo representations from "Deep contextualized word representations".

This repository supports both training biLMs and using pre-trained models for prediction.

We also have a pytorch implementation available in AllenNLP.

You may also find it easier to use the version provided in Tensorflow Hub if you just like to make predictions.

Citation:

@inproceedings{Peters:2018,
  author={Peters, Matthew E. and  Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke},
  title={Deep contextualized word representations},
  booktitle={Proc. of NAACL},
  year={2018}
}

Installing

Install python version 3.5 or later, tensorflow version 1.2 and h5py:

pip install tensorflow-gpu==1.2 h5py
python setup.py install

Ensure the tests pass in your environment by running:

python -m unittest discover tests/

Installing with Docker

To run the image, you must use nvidia-docker, because this repository requires GPUs.

sudo nvidia-docker run -t allennlp/bilm-tf:training-gpu

Using pre-trained models

We have several different English language pre-trained biLMs available for use. Each model is specified with two separate files, a JSON formatted "options" file with hyperparameters and a hdf5 formatted file with the model weights. Links to the pre-trained models are available here.

There are three ways to integrate ELMo representations into a downstream task, depending on your use case.

  1. Compute representations on the fly from raw text using character input. This is the most general method and will handle any input text. It is also the most computationally expensive.
  2. Precompute and cache the context independent token representations, then compute context dependent representations using the biLSTMs for input data. This method is less computationally expensive then #1, but is only applicable with a fixed, prescribed vocabulary.
  3. Precompute the representations for your entire dataset and save to a file.

We have used all of these methods in the past for various use cases. #1 is necessary for evaluating at test time on unseen data (e.g. public SQuAD leaderboard). #2 is a good compromise for large datasets where the size of the file in #3 is unfeasible (SNLI, SQuAD). #3 is a good choice for smaller datasets or in cases where you'd like to use ELMo in other frameworks.

In all cases, the process roughly follows the same steps. First, create a Batcher (or TokenBatcher for #2) to translate tokenized strings to numpy arrays of character (or token) ids. Then, load the pretrained ELMo model (class BidirectionalLanguageModel). Finally, for steps #1 and #2 use weight_layers to compute the final ELMo representations. For #3, use BidirectionalLanguageModel to write all the intermediate layers to a file.

Shape conventions

Each tokenized sentence is a list of str, with a batch of sentences a list of tokenized sentences (List[List[str]]).

The Batcher packs these into a shape (n_sentences, max_sentence_length + 2, 50) numpy array of character ids, padding on the right with 0 ids for sentences less then the maximum length. The first and last tokens for each sentence are special begin and end of sentence ids added by the Batcher.

The input character id placeholder can be dimensioned (None, None, 50), with both the batch dimension (axis=0) and time dimension (axis=1) determined for each batch, up the the maximum batch size specified in the BidirectionalLanguageModel constructor.

After running inference with the batch, the return biLM embeddings are a numpy array with shape (n_sentences, 3, max_sentence_length, 1024), after removing the special begin/end tokens.

Vocabulary file

The Batcher takes a vocabulary file as input for efficency. This is a text file, with one token per line, separated by newlines (\n). Each token in the vocabulary is cached as the appropriate 50 character id sequence once. Since the model is completely character based, tokens not in the vocabulary file are handled appropriately at run time, with a slight decrease in run time. It is recommended to always include the special <S> and </S> tokens (case sensitive) in the vocabulary file.

ELMo with character input

See usage_character.py for a detailed usage example.

ELMo with pre-computed and cached context independent token representations

To speed up model inference with a fixed, specified vocabulary, it is possible to pre-compute the context independent token representations, write them to a file, and re-use them for inference. Note that we don't support falling back to character inputs for out-of-vocabulary words, so this should only be used when the biLM is used to compute embeddings for input with a fixed, defined vocabulary.

To use this option:

  1. First create a vocabulary file with all of the unique tokens in your dataset and add the special <S> and </S> tokens.
  2. Run dump_token_embeddings with the full model to write the token embeddings to a hdf5 file.
  3. Use TokenBatcher (instead of Batcher) with your vocabulary file, and pass use_token_inputs=False and the name of the output file from step 2 to the BidirectonalLanguageModel constructor.

See usage_token.py for a detailed usage example.

Dumping biLM embeddings for an entire dataset to a single file.

To take this option, create a text file with your tokenized dataset. Each line is one tokenized sentence (whitespace separated). Then use dump_bilm_embeddings.

The output file is hdf5 format. Each sentence in the input data is stored as a dataset with key str(sentence_id) where sentence_id is the line number in the dataset file (indexed from 0). The embeddings for each sentence are a shape (3, n_tokens, 1024) array.

See usage_cached.py for a detailed example.

Training a biLM on a new corpus

Broadly speaking, the process to train and use a new biLM is:

  1. Prepare input data and a vocabulary file.
  2. Train the biLM.
  3. Test (compute the perplexity of) the biLM on heldout data.
  4. Write out the weights from the trained biLM to a hdf5 file.
  5. See the instructions above for using the output from Step #4 in downstream models.

1. Prepare input data and a vocabulary file.

To train and evaluate a biLM, you need to provide:

  • a vocabulary file
  • a set of training files
  • a set of heldout files

The vocabulary file is a a text file with one token per line. It must also include the special tokens <S>, </S> and <UNK> (case sensitive) in the file.

IMPORTANT: the vocabulary file should be sorted in descending order by token count in your training data. The first three lines should be the special tokens (<S>, </S> and <UNK>), then the most common token in the training data, ending with the least common token.

NOTE: the vocabulary file used in training may differ from the one use for prediction.

The training data should be randomly split into many training files, each containing one slice of the data. Each file contains pre-tokenized and white space separated text, one sentence per line. Don't include the <S> or </S> tokens in your training data.

All tokenization/normalization is done before training a model, so both the vocabulary file and training files should include normalized tokens. As the default settings use a fully character based token representation, in general we do not recommend any normalization other then tokenization.

Finally, reserve a small amount of the training data as heldout data for evaluating the trained biLM.

2. Train the biLM.

The hyperparameters used to train the ELMo model can be found in bin/train_elmo.py.

The ELMo model was trained on 3 GPUs. To train a new model with the same hyperparameters, first download the training data from the 1 Billion Word Benchmark. Then download the vocabulary file. Finally, run:

export CUDA_VISIBLE_DEVICES=0,1,2
python bin/train_elmo.py \
    --train_prefix='/path/to/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/*' \
    --vocab_file /path/to/vocab-2016-09-10.txt \
    --save_dir /output_path/to/checkpoint

3. Evaluate the trained model.

Use bin/run_test.py to evaluate a trained model, e.g.

export CUDA_VISIBLE_DEVICES=0
python bin/run_test.py \
    --test_prefix='/path/to/1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-000*' \
    --vocab_file /path/to/vocab-2016-09-10.txt \
    --save_dir /output_path/to/checkpoint

4. Convert the tensorflow checkpoint to hdf5 for prediction with bilm or allennlp.

First, create an options.json file for the newly trained model. To do so, follow the template in an existing file (e.g. the original options.json and modify for your hyperpararameters.

Important: always set n_characters to 262 after training (see below).

Then Run:

python bin/dump_weights.py \
    --save_dir /output_path/to/checkpoint
    --outfile /output_path/to/weights.hdf5

Frequently asked questions and other warnings

Can you provide the tensorflow checkpoint from training?

The tensorflow checkpoint is available by downloading these files:

How to do fine tune a model on additional unlabeled data?

First download the checkpoint files above. Then prepare the dataset as described in the section "Training a biLM on a new corpus", with the exception that we will use the existing vocabulary file instead of creating a new one. Finally, use the script bin/restart.py to restart training with the existing checkpoint on the new dataset. For small datasets (e.g. < 10 million tokens) we only recommend tuning for a small number of epochs and monitoring the perplexity on a heldout set, otherwise the model will overfit the small dataset.

Are the softmax weights available?

They are available in the training checkpoint above.

Can you provide some more details about how the model was trained?

The script bin/train_elmo.py has hyperparameters for training the model. The original model was trained on 3 GTX 1080 for 10 epochs, taking about two weeks.

For input processing, we used the raw 1 Billion Word Benchmark dataset here, and the existing vocabulary of 793471 tokens, including <S>, </S> and <UNK>. You can find our vocabulary file here. At the model input, all text used the full character based representation, including tokens outside the vocab. For the softmax output we replaced OOV tokens with <UNK>.

The model was trained with a fixed size window of 20 tokens. The batches were constructed by padding sentences with <S> and </S>, then packing tokens from one or more sentences into each row to fill completely fill each batch. Partial sentences and the LSTM states were carried over from batch to batch so that the language model could use information across batches for context, but backpropogation was broken at each batch boundary.

Why do I get slightly different embeddings if I run the same text through the pre-trained model twice?

As a result of the training method (see above), the LSTMs are stateful, and carry their state forward from batch to batch. Consequently, this introduces a small amount of non-determinism, expecially for the first two batches.

Why does training seem to take forever even with my small dataset?

The number of gradient updates during training is determined by:

  • the number of tokens in the training data (n_train_tokens)
  • the batch size (batch_size)
  • the number of epochs (n_epochs)

Be sure to set these values for your particular dataset in bin/train_elmo.py.

What's the deal with n_characters and padding?

During training, we fill each batch to exactly 20 tokens by adding <S> and </S> to each sentence, then packing tokens from one or more sentences into each row to fill completely fill each batch. As a result, we do not allocate space for a special padding token. The UnicodeCharsVocabulary that converts token strings to lists of character ids always uses a fixed number of character embeddings of n_characters=261, so always set n_characters=261 during training.

However, for prediction, we ensure each sentence is fully contained in a single batch, and as a result pad sentences of different lengths with a special padding id. This occurs in the Batcher see here. As a result, set n_characters=262 during prediction in the options.json.

How can I use ELMo to compute sentence representations?

Simple methods like average and max pooling of the word level ELMo representations across sentences works well, often outperforming supervised methods on benchmark datasets. See "Evaluation of sentence embeddings in downstream and linguistic probing tasks", Perone et al, 2018 arxiv link.

I'm seeing a WARNING when serializing models, is it a problem?

The below warning can be safely ignored:

2018-08-24 13:04:08,779 : WARNING : Error encountered when serializing lstm_output_embeddings.
Type is unsupported, or the types of the items don't match field type in CollectionDef.
'list' object has no attribute 'name'

More Repositories

1

allennlp

An open-source NLP research library, built on PyTorch.
Python
11,691
star
2

OLMo

Modeling, training, eval, and inference code for OLMo
Python
3,949
star
3

RL4LMs

A modular RL library to fine-tune language models to human preferences
Python
2,020
star
4

longformer

Longformer: The Long-Document Transformer
Python
1,955
star
5

scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
Python
1,566
star
6

bi-att-flow

Bi-directional Attention Flow (BiDAF) network is a multi-stage hierarchical process that represents context at different levels of granularity and uses a bi-directional attention flow mechanism to achieve a query-aware context representation without early summarization.
Python
1,524
star
7

scibert

A BERT model for scientific text.
Python
1,432
star
8

ai2thor

An open-source platform for Visual AI.
C#
1,010
star
9

open-instruct

Python
932
star
10

XNOR-Net

ImageNet classification using binary Convolutional Neural Networks
Lua
839
star
11

mmc4

MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.
Python
793
star
12

dolma

Data and tools for generating and inspecting OLMo pre-training data.
Python
774
star
13

s2orc

S2ORC: The Semantic Scholar Open Research Corpus: https://www.aclweb.org/anthology/2020.acl-main.447/
Python
745
star
14

scitldr

Python
734
star
15

natural-instructions

Expanding natural instructions
Python
690
star
16

visprog

Official code for VisProg (CVPR 2023 Best Paper!)
Python
642
star
17

papermage

library supporting NLP and CV research on scientific papers
Python
605
star
18

science-parse

Science Parse parses scientific papers (in PDF form) and returns them in structured form.
Java
566
star
19

writing-code-for-nlp-research-emnlp2018

A companion repository for the "Writing code for NLP Research" Tutorial at EMNLP 2018
Python
558
star
20

pdffigures2

Given a scholarly PDF, extract figures, tables, captions, and section titles.
Scala
514
star
21

allennlp-models

Officially supported AllenNLP models
Python
512
star
22

tango

Organize your experiments into discrete steps that can be cached and reused throughout the lifetime of your research project.
Python
507
star
23

specter

SPECTER: Document-level Representation Learning using Citation-informed Transformers
Python
495
star
24

objaverse-xl

πŸͺ Objaverse-XL is a Universe of 10M+ 3D Objects. Contains API Scripts for Downloading and Processing!
Python
490
star
25

dont-stop-pretraining

Code associated with the Don't Stop Pretraining ACL 2020 paper
Python
488
star
26

unified-io-2

Python
471
star
27

macaw

Multi-angle c(q)uestion answering
Python
451
star
28

document-qa

Python
420
star
29

scholarphi

An interactive PDF reader.
Python
410
star
30

deep_qa

A deep NLP library, based on Keras / tf, focused on question answering (but useful for other NLP too)
Python
405
star
31

acl2018-semantic-parsing-tutorial

Materials from the ACL 2018 tutorial on neural semantic parsing
402
star
32

unifiedqa

UnifiedQA: Crossing Format Boundaries With a Single QA System
Python
384
star
33

kb

KnowBert -- Knowledge Enhanced Contextual Word Representations
Python
359
star
34

pawls

Software that makes labeling PDFs easy.
Python
356
star
35

PeerRead

Data and code for Kang et al., NAACL 2018's paper titled "A Dataset of Peer Reviews (PeerRead): Collection, Insights and NLP Applications"
Python
354
star
36

naacl2021-longdoc-tutorial

Python
343
star
37

openie-standalone

Quality information extraction at web scale. Edit
Scala
329
star
38

python-package-template

A template repo for Python packages
Python
318
star
39

allenact

An open source framework for research in Embodied-AI from AI2.
Python
295
star
40

acl2022-zerofewshot-tutorial

293
star
41

ir_datasets

Provides a common interface to many IR ranking datasets.
Python
291
star
42

s2orc-doc2json

Parsers for scientific papers (PDF2JSON, TEX2JSON, JATS2JSON)
Python
290
star
43

beaker-cli

A collaborative platform for rapid and reproducible research.
Go
230
star
44

Holodeck

CVPR 2024: Language Guided Generation of 3D Embodied AI Environments.
Python
220
star
45

procthor

🏘️ Scaling Embodied AI by Procedurally Generating Interactive 3D Houses
Python
214
star
46

comet-atomic-2020

Python
212
star
47

FineGrainedRLHF

Python
209
star
48

fm-cheatsheet

Website for hosting the Open Foundation Models Cheat Sheet.
Python
207
star
49

spv2

Science-parse version 2
Python
206
star
50

scifact

Data and models for the SciFact verification task.
Python
206
star
51

OLMo-Eval

Evaluation suite for LLMs
Python
200
star
52

unified-io-inference

Jupyter Notebook
196
star
53

allennlp-demo

Code for the AllenNLP demo.
TypeScript
191
star
54

lumos

Code and data for "Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs"
Python
190
star
55

citeomatic

A citation recommendation system that allows users to find relevant citations for their paper drafts. The tool is backed by Semantic Scholar's OpenCorpus dataset.
Jupyter Notebook
182
star
56

cartography

Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics
Jupyter Notebook
180
star
57

savn

Learning to Learn how to Learn: Self-Adaptive Visual Navigation using Meta-Learning (https://arxiv.org/abs/1812.00971)
Python
175
star
58

vampire

Variational Methods for Pretraining in Resource-limited Environments
Python
173
star
59

objaverse-rendering

πŸ“· Scripts for rendering Objaverse
Python
169
star
60

hidden-networks

Python
164
star
61

ScienceWorld

ScienceWorld is a text-based virtual environment centered around accomplishing tasks from the standardized elementary science curriculum.
Scala
156
star
62

vila

Incorporating VIsual LAyout Structures for Scientific Text Classification
Python
155
star
63

mmda

multimodal document analysis
Jupyter Notebook
154
star
64

cord19

Get started with CORD-19
149
star
65

PRIMER

The official code for PRIMERA: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization
Python
145
star
66

dnw

Discovering Neural Wirings (https://arxiv.org/abs/1906.00586)
Python
139
star
67

tpu_pretrain

LM Pretraining with PyTorch/TPU
Python
129
star
68

deepfigures-open

Companion code to the paper "Extracting Scientific Figures with Distantly Supervised Neural Networks" πŸ€–
Python
129
star
69

catwalk

This project studies the performance and robustness of language models and task-adaptation methods.
Python
129
star
70

allentune

Hyperparameter Search for AllenNLP
Python
128
star
71

lm-explorer

interactive explorer for language models
Python
127
star
72

pdffigures

Command line tool to extract figures, tables, and captions from scholarly documents in PDF form.
C++
125
star
73

SciREX

Data/Code Repository for https://api.semanticscholar.org/CorpusID:218470122
Python
125
star
74

s2-folks

Public space for the user community of Semantic Scholar APIs to share scripts, report issues, and make suggestions.
125
star
75

scidocs

Dataset accompanying the SPECTER model
Python
124
star
76

gooaq

Question-answers, collected from Google
Python
116
star
77

OpenBookQA

Code for experiments on OpenBookQA from the EMNLP 2018 paper "Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering"
Python
113
star
78

allennlp-as-a-library-example

A simple example for how to build your own model using AllenNLP as a dependency.
Python
113
star
79

alexafsm

With alexafsm, developers can model dialog agents with first-class concepts such as states, attributes, transition, and actions. alexafsm also provides visualization and other tools to help understand, test, debug, and maintain complex FSM conversations.
Python
108
star
80

allennlp-semparse

A framework for building semantic parsers (including neural module networks) with AllenNLP, built by the authors of AllenNLP
Python
107
star
81

scicite

Repository for NAACL 2019 paper on Citation Intent prediction
Python
106
star
82

peS2o

Pretraining Efficiently on S2ORC!
105
star
83

multimodalqa

Python
102
star
84

commonsense-kg-completion

Python
102
star
85

real-toxicity-prompts

Jupyter Notebook
101
star
86

ai2thor-rearrangement

πŸ”€ Visual Room Rearrangement
Python
97
star
87

embodied-clip

Official codebase for EmbCLIP
Python
97
star
88

aristo-mini

Aristo mini is a light-weight question answering system that can quickly evaluate Aristo science questions with an evaluation web server and the provided baseline solvers.
Python
96
star
89

s2search

The Semantic Scholar Search Reranker
Python
93
star
90

elastic

Python
91
star
91

reward-bench

RewardBench: the first evaluation tool for reward models.
Python
90
star
92

flex

Few-shot NLP benchmark for unified, rigorous eval
Python
89
star
93

gpv-1

A task-agnostic vision-language architecture as a step towards General Purpose Vision
Jupyter Notebook
89
star
94

manipulathor

ManipulaTHOR, a framework that facilitates visual manipulation of objects using a robotic arm
Jupyter Notebook
86
star
95

medicat

Dataset of medical images, captions, subfigure-subcaption annotations, and inline textual references
Python
85
star
96

propara

ProPara (Process Paragraph Comprehension) dataset and models
Python
82
star
97

allennlp-guide

Code and material for the AllenNLP Guide
Python
81
star
98

hierplane

A tool for visualizing trees, tailored specifically to the analysis of parse trees.
JavaScript
81
star
99

S2AND

Semantic Scholar's Author Disambiguation Algorithm & Evaluation Suite
Python
78
star
100

ARC-Solvers

ARC Question Solvers
Python
78
star