• Stars
    star
    124
  • Rank 287,361 (Top 6 %)
  • Language
    Python
  • License
    Other
  • Created over 8 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Bidirectional Long-Short Term Memory tagger (bi-LSTM) (in DyNet) -- hierarchical (with word and character embeddings)

bi-LSTM sequence tagger

Bidirectional Long-Short Term Memory sequence tagger

This is an extended version (structbilty) of the earlier bi-LSTM tagger by Plank et al., (2016).

If you use this tagger please cite:

@inproceedings{plank-etal-2016,
    title = "Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss",
    author = "Plank, Barbara  and
      S{\o}gaard, Anders  and
      Goldberg, Yoav",
    booktitle = "Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
    month = aug,
    year = "2016",
    address = "Berlin, Germany",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P16-2067",
    doi = "10.18653/v1/P16-2067",
    pages = "412--418",
}

For the version called DsDs, please cite: https://aclanthology.coli.uni-saarland.de/papers/D18-1061/d18-1061

Installation

pip3 install --user -r requirements.txt

Example command

Training the tagger:

python src/structbilty.py --dynet-mem 1500 --train data/da-ud-train.conllu --iters 10 --model da

Training with patience (requires a dev set):

python src/structbilty.py --dynet-mem 1500 --train data/da-ud-train.conllu --dev data/da-ud-dev.conllu --iters 50 --model da --patience 2

Testing and getting the output predictions:

python src/structbilty.py --model da --test data/da-ud-test.conllu --output predictions/test-da.out

Training and testing in two steps (--model for both saving and loading):

mkdir -p predictions
python src/structbilty.py --dynet-mem 1500 --train data/da-ud-train.conllu --iters 10 --model da

python src/structbilty.py --model da --test data/da-ud-test.conllu --output predictions/test-da.out

By default, the model uses a softmax decoder. You can use a CRF for BIO sequence tagging with the --crf option. The model uses accuracy as default output. If you use the tagger for NER or similar, make sure to not rely on accuracy but use span-F1 or similar.

Embeddings

The Polyglot embeddings (Al-Rfou et al., 2013) can be downloaded from here (0.6GB)

You can load generic word embeddings by using --embeds WORD_EMBEDS_FILE (as the Polyglot ones above). Note that the dimensions of embeddings should match the --in_dim option.

Bilty also supports loading additional embeddings from the input files. This can be enabled by --embeds_in_file FILE. It expects the train/dev/test files to be in the following format:

word1<tab>tag1<tab>emb=val1,val2,val3,...
word2<tab>tag1<tab>emb=val1,val2,val3,...
...

Note that the dimensions of embeddings should match the --embeds_in_file_dim option.

We also provide scripts to generate these files for four commonly used embeddings types (Polyglot, Fasttext, ELMo and BERT), which can be found in the embeds folder. If we for example want to use BERT embeddings we need to run the following commands:

python3 embeds/transf.py bert-base-multilingual-cased data/da-ud-train.conllu
python3 embeds/transf.py bert-base-multilingual-cased data/da-ud-dev.conllu
python3 embeds/transf.py bert-base-multilingual-cased data/da-ud-test.conllu

This creates .bert files which can be used as input to Bilty when --embeds_in_file is enabled.

Similar scripts for Poly are in the embeds folder. For now the language for most of these is hardcoded in the scripts, please modify *.prep.py accordingly.

Please note that this option does not support the --raw option.

Options:

You can see the options by running:

python src/structbilty.py --help

A great option is DyNet autobatching (Neubig et al., 2017). It speeds up training considerably ( ~20%). You can activate it with:

python src/structbilty.sh --dynet-autobatch 1

Major changes:

  • major refactoring of internal data handling
  • renaming to structbilty
  • --pred-layer is no longer required
  • a single --model options handles both saving and loading model parameters
  • the option of running a CRF has been added
  • the tagger can handle additional lexical features (see our DsDs paper, EMNLP 2018) below
  • grouping of arguments
  • simplebilty is deprecated (still available in the former release)
  • best to run it on a simple CPU

References

# default reference
@inproceedings{plank-etal-2016,
    title = "Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss",
    author = "Plank, Barbara  and
      S{\o}gaard, Anders  and
      Goldberg, Yoav",
    booktitle = "Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
    month = aug,
    year = "2016",
    address = "Berlin, Germany",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P16-2067",
    doi = "10.18653/v1/P16-2067",
    pages = "412--418",
}

# for DdDs
@InProceedings{plank-agic:2018,
  author = 	"Plank, Barbara
		and Agi{\'{c}}, {\v{Z}}eljko",
  title = 	"Distant Supervision from Disparate Sources for Low-Resource Part-of-Speech Tagging",
  booktitle = 	"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",
  year = 	"2018",
  publisher = 	"Association for Computational Linguistics",
  pages = 	"614--620",
  location = 	"Brussels, Belgium",
  url = 	"http://aclweb.org/anthology/D18-1061"
}


Installation from source (alternative)

You can compile dynet from source. Clone it into a directory of your choice called DYNETDIR:

mkdir $DYNETDIR
git clone https://github.com/clab/dynet

Follow the instructions in the Dynet documentation (use -DPYTHON, see http://dynet.readthedocs.io/en/latest/python.html).

And compile dynet:

cmake .. -DEIGEN3_INCLUDE_DIR=$HOME/tools/eigen/ -DPYTHON=`which python`

(if you have a GPU, use: [note: non-deterministic behavior]):

cmake .. -DEIGEN3_INCLUDE_DIR=$HOME/tools/eigen/ -DPYTHON=`which python` -DBACKEND=cuda

(You may need to set you PYTHONPATH to include Dynet's build/python)

After successful installation open python and import dynet, you can test if the installation worked with:

>>> import dynet
[dynet] random seed: 2809331847
[dynet] allocating memory: 512MB
[dynet] memory allocation done.
>>> dynet.__version__
2.0

More Repositories

1

awesome-neural-adaptation-in-NLP

Awesome Neural Adaptation in Natural Language Processing. A curated list. https://arxiv.org/abs/2006.00632
261
star
2

semi-supervised-baselines

Code for "Strong Baselines for Neural Semi-supervised Learning under Domain Shift" (Ruder & Plank, 2018 ACL)
Python
60
star
3

ijcnlp2017-customer-feedback

Code for "All-In-1: Short Text Classification with One Model for All Languages" - Plank (2017), IJCNLP 2017 shared task 4
Python
17
star
4

sp

Simple Structured Perceptron tagger in Python
Python
10
star
5

DaNplus

DaN+ corpus: Danish Nested Named Entities and Lexical Normalization
Python
5
star
6

2018-ma-notebooks

Jupyter Notebook
4
star
7

teaching-dl4nlp

4
star
8

domainsim

Measures of Domain Similarity
Python
4
star
9

twitter_api_examples

A tiny Twitter tutorial
Jupyter Notebook
4
star
10

multilingualtokenizer

A trivial punctuation-based sentence splitter and tokenizer for multilingual data.
Python
4
star
11

danish_ner_transfer

Danish Named Enity Recognition (NER)
3
star
12

wnut-2017-pos-norm

Resources for (van den Goot et al., 2017) - WNUT 2017
Pascal
3
star
13

nested-ner

3
star
14

ltp-notebooks-2017

Language Technology Project (LTP) 2017
Jupyter Notebook
3
star
15

slam-2018

SLAM 2018
Python
2
star
16

aat

Python
2
star
17

coling2016ks

Jupyter Notebook
2
star
18

install-ubuntu18.04-CUDA-tensorflow-keras

2
star
19

2019-ma-notebook

Jupyter Notebook
2
star
20

ltp-notebooks-2016

Language Technology Project (LTP) 2016
Jupyter Notebook
2
star
21

bleaching-text

Code for "Bleaching Text: Abstract Features for Cross-lingual Gender Prediction" (van der Goot et al., 2018)
Python
1
star
22

BA-scriptie-2017

Jupyter Notebook
1
star
23

myconllutils

various Python utilities for manipulating CoNLL (dependency parsing) files
Python
1
star
24

2017-CollectingData

Jupyter Notebook
1
star
25

BA-scriptie

Teaching material (including intro to ML using sklearn)
Jupyter Notebook
1
star
26

example_ff

a very simple feedforward neural network example in Keras
Python
1
star