• Stars
    star
    234
  • Rank 171,630 (Top 4 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created almost 7 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

scikit-learn wrappers for Python fastText.

skift skift_icon

PyPI-Status PePy stats PyPI-Versions Build-Status Codecov Codefactor code quality LICENCE

scikit-learn wrappers for Python fastText.

>>> from skift import FirstColFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = FirstColFtClassifier(lr=0.3, epoch=10)
>>> sk_clf.fit(df[['txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]

1   Installation

Dependencies:

  • numpy
  • scipy
  • scikit-learn
  • The fasttext Python package
pip install skift

2   Configuration

Because fasttext reads input data from files, skift has to dump the input data into temporary files for fasttext to use. A dedicated folder is created for those files on the filesystem. By default, this storage is allocated in the system temporary storage location (i.e. /tmp on *nix systems). To override this default location, use the SKIFT_TEMP_DIR environment variable:

export SKIFT_TEMP_DIR=/path/to/desired/temp/folder

NOTE: The directory will be created if it does not already exist.

3   Features

4   Wrappers

fastText works only on text data, which means that it will only use a single column from a dataset which might contain many feature columns of different types. As such, a common use case is to have the fastText classifier use a single column as input, ignoring other columns. This is especially true when fastText is to be used as one of several classifiers in a stacking classifier, with other classifiers using non-textual features.

skift includes several scikit-learn-compatible wrappers (for the official fastText Python package) which cater to these use cases.

NOTICE: Any additional keyword arguments provided to the classifier constructor, besides those required, will be forwarded to the fastText.train_supervised method on every call to fit.

4.1   Standard wrappers

These wrappers do not make additional assumptions on input besides those commonly made by scikit-learn classifies; i.e. that input is a 2d ndarray object and such.

  • FirstColFtClassifier - An sklearn classifier adapter for fasttext that takes the first column of input ndarray objects as input.
>>> from skift import FirstColFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = FirstColFtClassifier(lr=0.3, epoch=10)
>>> sk_clf.fit(df[['txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]
  • IdxBasedFtClassifier - An sklearn classifier adapter for fasttext that takes input by column index. This is set on object construction by providing the input_ix parameter to the constructor.
>>> from skift import IdxBasedFtClassifier
>>> df = pandas.DataFrame([[5, 'woof', 0], [83, 'meow', 1]], columns=['count', 'txt', 'lbl'])
>>> sk_clf = IdxBasedFtClassifier(input_ix=1, lr=0.4, epoch=6)
>>> sk_clf.fit(df[['count', 'txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]

4.2   pandas-dependent wrappers

These wrappers assume the X parameter given to fit, predict, and predict_proba methods is a pandas.DataFrame object:

  • FirstObjFtClassifier - An sklearn adapter for fasttext using the first column of dtype == object as input.
>>> from skift import FirstObjFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = FirstObjFtClassifier(lr=0.2)
>>> sk_clf.fit(df[['txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]
  • ColLblBasedFtClassifier - An sklearn adapter for fasttext taking input by column label. This is set on object construction by providing the input_col_lbl parameter to the constructor.
>>> from skift import ColLblBasedFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = ColLblBasedFtClassifier(input_col_lbl='txt', epoch=8)
>>> sk_clf.fit(df[['txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]
  • SeriesFtClassifier - An sklearn adapter for fasttext taking a Pandas Series as input.
>>> from skift import SeriesFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = SeriesFtClassifier(input_col_lbl='txt', epoch=8)
>>> sk_clf.fit(df['txt'], df['lbl'])
>>> sk_clf.predict(['woof'])
>>> sk_clf.predict(df['txt'])

4.3   Hyperparameter auto-tuning

It's possible to pass a validation set to fit() in order to optimize the hyper-parameters.

First, to adjust the auto-tune settings, the corresponding keyword arguments can be passed to the constructor (if none are passed the default settings are used):

>>> from skift import SeriesFtClassifier
>>> df_train = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> df_val = pandas.DataFrame([['woof woof', 0], ['meow meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = SeriesFtClassifier(epoch=8, autotuneDuration=5)

Then, the validation dataframe (or series, in this case, since we constructed a SeriesFtClassifier) and label column should be provided to the fit() method:

>>> sk_clf.fit(df_train['txt'], df_train['lbl'], X_validation=df_val['txt'], y_validation=df_val['lbl'])

Or simply by position:

>>> sk_clf.fit(df_train['txt'], df_train['lbl'], df_val['txt'], df_val['lbl'])

4.4   Using Pre-trained word vectors

This is done in the exact same way as with the Python module or the fastText CLI, but not setting the right vector dimensions in the constructor (identical to the dimensions of the pretrained vectors you are using) will crash fastText without explanation, so we provide an example:

from skift import SeriesFtClassifier
ft_clf = SeriesFtClassifier(
    autotuneDuration=900,
    pretrainedVectors='/Users/myuser/data/word_vectors/crawl-300d-2M.vec',
    dim=300,
)

In this case, not providing the constructor with dim=300 would bring about a crash when calling ft_clf.fit().

5   Contributing

Package author and current maintainer is Shay Palachy ([email protected]); You are more than welcome to approach him for help. Contributions are very welcomed.

5.1   Installing for development

Clone:

git clone [email protected]:shaypal5/skift.git

Install in development mode, including test dependencies:

cd skift
pip install -e '.[test]'

To also install fasttext, see instructions in the Installation section.

5.2   Running the tests

To run the tests use:

cd skift
pytest

5.3   Adding documentation

The project is documented using the numpy docstring conventions, which were chosen as they are perhaps the most widely-spread conventions that are both supported by common tools such as Sphinx and result in human-readable docstrings. When documenting code you add to this project, follow these conventions.

Additionally, if you update this README.rst file, use python setup.py checkdocs to validate it compiles.

6   Credits

Created by Shay Palachy ([email protected]).

Contributions:

Fixes: uniaz, crouffer, amirzamli and sgt.

More Repositories

1

awesome-twitter-data

A list of Twitter datasets and related resources.
939
star
2

stationarizer

Smart, automatic detection and stationarization of non-stationary time series data.
Jupyter Notebook
29
star
3

s3bp

Read and write Python objects to S3, caching them on your hard drive to avoid unnecessary IO.
Python
24
star
4

birch

Simple hierarchical configuration for Python packages.
Python
13
star
5

lazyimport

lazyimport lets you import python modules lazily.
Python
11
star
6

holcrawl

holcrawl is a crawler for building Hollywood movies datsets.
Python
9
star
7

morejson

A drop-in replacement for Python's json that handles additional built-in Python types.
Python
7
star
8

active_learning_for_domain_adaptation_in_sentiment_analysis

Code for the Active Learning for Domain Adaptation in Sentiment Analysis paper by Shay Palachy and Inbar Naor
Python
7
star
9

imbutil

Additions to the imblearn package.
Python
6
star
10

exploring_networks_with_python_intro

An introduction to exploring network-structured datasets with python's networkx package.
Jupyter Notebook
6
star
11

pdutil

Utilities for pandas.
Python
6
star
12

rotten_needles

Rotten Needles: Online movie ratings and success in Hollywood movies
Jupyter Notebook
5
star
13

tqdl

requests-based file downloads with tqdm progress bars.
Python
5
star
14

mongozen

Enhance MongoDB for Python dynamic shells and scripts.
Python
4
star
15

catlolzer

Concise Python-based lolzing for cats.
Python
4
star
16

skutil

Utilities for scikit-learn.
Python
3
star
17

ssdts_matching

Fast matching of items for source-sharing derivative time series.
Python
3
star
18

barn

Simple local/remote dataset store for Python.
Python
2
star
19

decore

A small pure-python package for utility decorators.
Python
1
star
20

strct

A small pure-python package for data structure related utility functions.
Python
1
star
21

my.sublime

My sublime configuration
JavaScript
1
star
22

shaypal5.github.io

Shay Palachy's personal website.
PHP
1
star
23

utilitime

A small pure-python package for time-related utility functions.
Python
1
star
24

comath

A small pure-python package for math-related utility functions.
Python
1
star