• Stars
    star
    700
  • Rank 64,214 (Top 2 %)
  • Language
    Python
  • License
    Mozilla Public Li...
  • Created over 6 years ago
  • Updated about 1 month ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Scikit-learn style model finetuning for NLP

Scikit-learn style model finetuning for NLP

Finetune is a library that allows users to leverage state-of-the-art pretrained NLP models for a wide variety of downstream tasks.

Finetune currently supports TensorFlow implementations of the following models:

  1. BERT, from "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"
  2. RoBERTa, from "RoBERTa: A Robustly Optimized BERT Pretraining Approach"
  3. GPT, from "Improving Language Understanding by Generative Pre-Training"
  4. GPT2, from "Language Models are Unsupervised Multitask Learners"
  5. TextCNN, from "Convolutional Neural Networks for Sentence Classification"
  6. Temporal Convolution Network, from "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling"
  7. DistilBERT from "Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT"
Section Description
API Tour Base models, configurables, and more
Installation How to install using pip or directly from source
Finetune with Docker Finetune and inference within a Docker Container
Documentation Full API documentation

Finetune API Tour

Finetuning the base language model is as easy as calling Classifier.fit:

model = Classifier()               # Load base model
model.fit(trainX, trainY)          # Finetune base model on custom data
model.save(path)                   # Serialize the model to disk
...
model = Classifier.load(path)      # Reload models from disk at any time
predictions = model.predict(testX) # [{'class_1': 0.23, 'class_2': 0.54, ..}, ..]

Choose your desired base model from finetune.base_models:

from finetune.base_models import BERT, RoBERTa, GPT, GPT2, TextCNN, TCN
model = Classifier(base_model=BERT)

Optimize your model with a variety of configurables. A detailed list of all config items can be found in the finetune docs.

model = Classifier(low_memory_mode=True, lr_schedule="warmup_linear", max_length=512, l2_reg=0.01, oversample=True, ...)

The library supports finetuning for a number of tasks. A detailed description of all target models can be found in the finetune API reference.

from finetune import *
models = (Classifier, MultiLabelClassifier, MultiFieldClassifier, MultipleChoice, # Classify one or more inputs into one or more classes
          Regressor, OrdinalRegressor, MultifieldRegressor,                       # Regress on one or more inputs
          SequenceLabeler, Association,                                           # Extract tokens from a given class, or infer relationships between them
          Comparison, ComparisonRegressor, ComparisonOrdinalRegressor,            # Compare two documents for a given task
          LanguageModel, MultiTask,                                               # Further pretrain your base models
          DeploymentModel                                                         # Wrapper to optimize your serialized models for a production environment
          )

For example usage of each of these target types, see the finetune/datasets directory. For purposes of simplicity and runtime these examples use smaller versions of the published datasets.

If you have large amounts of unlabeled training data and only a small amount of labeled training data, you can finetune in two steps for best performance.

model = Classifier()               # Load base model
model.fit(unlabeledX)              # Finetune base model on unlabeled training data
model.fit(trainX, trainY)          # Continue finetuning with a smaller amount of labeled data
predictions = model.predict(testX) # [{'class_1': 0.23, 'class_2': 0.54, ..}, ..]
model.save(path)                   # Serialize the model to disk

Installation

Finetune can be installed directly from PyPI by using pip

pip3 install finetune

or installed directly from source:

git clone -b master https://github.com/IndicoDataSolutions/finetune && cd finetune
python3 setup.py develop              # symlinks the git directory to your python path
pip3 install tensorflow-gpu --upgrade # or tensorflow-cpu
python3 -m spacy download en          # download spacy tokenizer

In order to run finetune on your host, you'll need a working copy of tensorflow-gpu >= 1.14.0 and up to date nvidia-driver versions.

You can optionally run the provided test suite to ensure installation completed successfully.

pip3 install pytest
pytest

Docker

If you'd prefer you can also run finetune in a docker container. The bash scripts provided assume you have a functional install of docker and nvidia-docker.

git clone https://github.com/IndicoDataSolutions/finetune && cd finetune

# For usage with NVIDIA GPUs
./docker/build_gpu_docker.sh      # builds a docker image
./docker/start_gpu_docker.sh      # starts a docker container in the background, forwards $PWD to /finetune

docker exec -it finetune bash # starts a bash session in the docker container

For CPU-only usage:

./docker/build_cpu_docker.sh
./docker/start_cpu_docker.sh

Documentation

Full documentation and an API Reference for finetune is available at finetune.indico.io.

More Repositories

1

Passage

A little library for text analysis with RNNs.
Python
531
star
2

Enso

Enso: An Open Source Library for Benchmarking Embeddings + Transfer Learning Methods
Python
95
star
3

IndicoIo-node

A Node.js wrapper for the Indico API
JavaScript
62
star
4

SuperCell

Public tutorials and code that accompanies articles
Jupyter Notebook
40
star
5

Foxhound

Scikit learn inspired library for gpu-accelerated machine learning
Python
38
star
6

ImageSimilarity

Demo using image_features api to sort images based on similarity.
JavaScript
29
star
7

plotlines

Exploring the shapes of stories using indico sentiment analysis APIs
28
star
8

IndicoIo-ruby

A simple Ruby Wrapper for the indico set of APIs
Ruby
13
star
9

clothing_similarity

Final and skeleton code for the clothing similarity walkthrough
Python
10
star
10

Indico-Solutions-Toolkit

A library to assist in integrating the Indico IPA platform
Python
9
star
11

Doc2Dict

Code accompanying Doc2Dict paper
Python
9
star
12

IndicoIo-PHP

A simple PHP Wrapper for the Indico API
PHP
7
star
13

IndicoIo-R

A simple R Wrapper for the indico set of APIs
R
6
star
14

ClusterRSS

A small app for clustering the content of RSS feeds
JavaScript
6
star
15

ImageFeaturesClassifier

Using indico's imagefeatures API and scikit-learn to produce a solve an image classification task
Python
5
star
16

spaCy

Clone of spaCy for confidence levels
Python
5
star
17

indico-client-python

Indico IPA client library
Python
5
star
18

virga

Template-based adaptable sidecar app generation and plugins for deployment alongside Indico's IPA.
Python
5
star
19

indi-flask

A template for building flask apps that use indico
Python
4
star
20

TwitterSentiment

A demo of indico's sentiment API
JavaScript
4
star
21

IntercomBot

Bot for triaging incoming intercom requests and assigning them to the right people.
Python
4
star
22

tf_cod

Terraform repository for Clusters on Demand (COD)
HCL
3
star
23

SentimentDemo

Tracking how sentiment changes throughout a novel.
Python
3
star
24

indico-client-java

Indico IPA java client
Kotlin
2
star
25

content_recommendation

A simple script to recommend content to a user based on things that they say.
Python
2
star
26

groundtruth

Ground Truth Analysis Tooling
Python
2
star
27

KNNQuery

Constant time nearest neighbors querying. Hopefully.
2
star
28

indico-pretrained-uipath-demo

Demo project showing off the use of indico's pretrained api activities for UIPath
Visual Basic
2
star
29

asyncio-chainable

Python
2
star
30

indico-blueprism-custom-actions

Indico Custom Actions for Blue Prism
C#
2
star
31

Custom-Workflow-Template

template for custom workflow
Python
1
star
32

IndicoEditor

Use indico's APIs to explore trends and patterns in your writing.
Python
1
star
33

indicoio-mathematica

Mathematica package for accessing predictive APIs from indico.io
Mathematica
1
star
34

indico-tf-ops

C++
1
star
35

RSSCustomization

RSS feed customization using the indico text tags API.
Python
1
star
36

LaunchAcademy

Source code for lessons at launch academy
Ruby
1
star
37

indico-uipath-custom-activities

C#
1
star
38

raw_requests

Python
1
star
39

indico-ui

Indico UI Theme for Atom
CSS
1
star
40

indico-client-csharp

C#
1
star
41

xpdf_modified

Modified version of the xpdf library allowing json dumps
C++
1
star
42

IndicoIo-LoaderIo

Python script to generate load tests for indico clouds
Python
1
star
43

ContentRecommendation

Proof of concept for a content recommendation system using the indico text tags API.
Python
1
star