• Stars
    star
    395
  • Rank 109,040 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created over 5 years ago
  • Updated 9 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

TextAugment: Text Augmentation Library

TextAugment: Improving Short Text Classification through Global Augmentation Methods

licence GitHub release Wheel python TotalDownloads Downloads LNCS arxiv

You have just found TextAugment.

TextAugment is a Python 3 library for augmenting text for natural language processing applications. TextAugment stands on the giant shoulders of NLTK, Gensim, and TextBlob and plays nicely with them.

Table of Contents

Features

  • Generate synthetic data for improving model performance without manual effort
  • Simple, lightweight, easy-to-use library.
  • Plug and play to any machine learning frameworks (e.g. PyTorch, TensorFlow, Scikit-learn)
  • Support textual data

Citation Paper

Improving short text classification through global augmentation methods.

alt text

Requirements

  • Python 3

The following software packages are dependencies and will be installed automatically.

$ pip install numpy nltk gensim textblob googletrans 

The following code downloads NLTK corpus for wordnet.

nltk.download('wordnet')

The following code downloads NLTK tokenizer. This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences.

nltk.download('punkt')

The following code downloads default NLTK part-of-speech tagger model. A part-of-speech tagger processes a sequence of words, and attaches a part of speech tag to each word.

nltk.download('averaged_perceptron_tagger')

Use gensim to load a pre-trained word2vec model. Like Google News from Google drive.

import gensim
model = gensim.models.Word2Vec.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)

You can also use gensim to load Facebook's Fasttext English and Multilingual models

import gensim
model = gensim.models.fasttext.load_facebook_model('./cc.en.300.bin.gz')

Or training one from scratch using your data or the following public dataset:

Installation

Install from pip [Recommended]

$ pip install textaugment
or install latest release
$ pip install [email protected]:dsfsi/textaugment.git

Install from source

$ git clone [email protected]:dsfsi/textaugment.git
$ cd textaugment
$ python setup.py install

How to use

There are three types of augmentations which can be used:

  • word2vec
from textaugment import Word2vec
  • wordnet
from textaugment import Wordnet
  • translate (This will require internet access)
from textaugment import Translate

Word2vec-based augmentation

See this notebook for an example

Basic example

>>> from textaugment import Word2vec
>>> t = Word2vec(model='path/to/gensim/model'or 'gensim model itself')
>>> t.augment('The stories are good')
The films are good

Advanced example

>>> runs = 1 # By default.
>>> v = False # verbose mode to replace all the words. If enabled runs is not effective. Used in this paper (https://www.cs.cmu.edu/~diyiy/docs/emnlp_wang_2015.pdf)
>>> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence.

>>> t = Word2vec(model='path/to/gensim/model'or'gensim model itself', runs=5, v=False, p=0.5)
>>> t.augment('The stories are good')
The movies are excellent

WordNet-based augmentation

Basic example

>>> import nltk
>>> nltk.download('punkt')
>>> nltk.download('wordnet')
>>> from textaugment import Wordnet
>>> t = Wordnet()
>>> t.augment('In the afternoon, John is going to town')
In the afternoon, John is walking to town

Advanced example

>>> v = True # enable verbs augmentation. By default is True.
>>> n = False # enable nouns augmentation. By default is False.
>>> runs = 1 # number of times to augment a sentence. By default is 1.
>>> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence.

>>> t = Wordnet(v=False ,n=True, p=0.5)
>>> t.augment('In the afternoon, John is going to town')
In the afternoon, Joseph is going to town.

RTT-based augmentation

Example

>>> src = "en" # source language of the sentence
>>> to = "fr" # target language
>>> from textaugment import Translate
>>> t = Translate(src="en", to="fr")
>>> t.augment('In the afternoon, John is going to town')
In the afternoon John goes to town

EDA: Easy data augmentation techniques for boosting performance on text classification tasks

This is the implementation of EDA by Jason Wei and Kai Zou.

https://www.aclweb.org/anthology/D19-1670.pdf

See this notebook for an example

Synonym Replacement

Randomly choose n words from the sentence that are not stop words. Replace each of these words with one of its synonyms chosen at random.

Basic example

>>> from textaugment import EDA
>>> t = EDA()
>>> t.synonym_replacement("John is going to town")
John is give out to town

Random Deletion

Randomly remove each word in the sentence with probability p.

Basic example

>>> from textaugment import EDA
>>> t = EDA()
>>> t.random_deletion("John is going to town", p=0.2)
is going to town

Random Swap

Randomly choose two words in the sentence and swap their positions. Do this n times.

Basic example

>>> from textaugment import EDA
>>> t = EDA()
>>> t.random_swap("John is going to town")
John town going to is

Random Insertion

Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this n times

Basic example

>>> from textaugment import EDA
>>> t = EDA()
>>> t.random_insertion("John is going to town")
John is going to make up town

Mixup augmentation

This is the implementation of mixup augmentation by Hongyi Zhang, Moustapha Cisse, Yann Dauphin, David Lopez-Paz adapted to NLP.

Used in Augmenting Data with Mixup for Sentence Classification: An Empirical Study.

Mixup is a generic and straightforward data augmentation principle. In essence, mixup trains a neural network on convex combinations of pairs of examples and their labels. By doing so, mixup regularises the neural network to favour simple linear behaviour in-between training examples.

Implementation

See this notebook for an example

Built with ❤ on

Authors

Acknowledgements

Cite this paper when using this library. Arxiv Version

@inproceedings{marivate2020improving,
  title={Improving short text classification through global augmentation methods},
  author={Marivate, Vukosi and Sefara, Tshephisho},
  booktitle={International Cross-Domain Conference for Machine Learning and Knowledge Extraction},
  pages={385--399},
  year={2020},
  organization={Springer}
}

Licence

MIT licensed. See the bundled LICENCE file for more details.

More Repositories

1

covid19za

Coronavirus COVID-19 (2019-nCoV) Data Repository and Dashboard for South Africa
Jupyter Notebook
255
star
2

covid19africa

Africa open COVID-19 data working group
Jupyter Notebook
48
star
3

masakhane-web

Masakhane Web is a translation web application for solely African Languages.
Jupyter Notebook
34
star
4

PuoBERTa

A Roberta-based language model specially designed for Setswana, using the new PuoData dataset.
Makefile
4
star
5

gov-za-multilingual

The data set contains cabinet statements from the South African government. Data was scraped from the governments website: https://www.gov.za/cabinet-statements
Jupyter Notebook
4
star
6

Higher_Education_EDA

This is an EDA Git for education researchers and practitioners
Jupyter Notebook
3
star
7

project-state-capture

Zondo Commission or State Capture Commission Transcripts
2
star
8

za-mavito

DSFSI South African Terminlogy Lists and Lexicon Project
HTML
2
star
9

dsfsi-datasets

Datasets made available for different small projects
Jupyter Notebook
2
star
10

PuoData

Curated corpora for Setswana. Used to train PuoBERTa.
2
star
11

za-bank-risk

This repository is an initial pipeline for reading, processing, labelling and classifying unstructured annual reports of South African (SA) banks with the aim of identifying financial risk. It leveraged work by the Corporate Financial Information Environment-Final Report Structure Extractor (CFIE–FRSE) of El-Haj et al. which created a corpus of annual reports of United Kingdom (UK) companies.
Jupyter Notebook
2
star
12

sa-parliament

South African Member Of Parliament Data
Python
2
star
13

embedding-eval-data

Embedding Evaluation Data for South African Languages
1
star
14

2020-AMMI-salomon

Jupyter Notebook
1
star
15

dsfsi-dataset-template

Makefile
1
star
16

zabantu-beta

ZaBantu is a fleet of light-weight Masked Language Models for Southern Bantu Languages
Python
1
star
17

gov-za-sona-multilingual

Python
1
star
18

izindaba-zesizulu

Categorised isiZulu News. Source data is the isiZulu news from the SABC social media posts.
1
star