• Stars
    star
    548
  • Rank 81,119 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created over 3 years ago
  • Updated 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Active Learning for Text Classification in Python

PyPI Conda Forge codecov Documentation Status Maintained Yes Contributions Welcome MIT License DOI Twitter URL

small-text logo

Active Learning for Text Classification in Python.


Installation | Quick Start | Contribution | Changelog | Docs

Small-Text provides state-of-the-art Active Learning for Text Classification. Several pre-implemented Query Strategies, Initialization Strategies, and Stopping Critera are provided, which can be easily mixed and matched to build active learning experiments or applications.

What is Active Learning?
Active Learning allows you to efficiently label training data in a small data scenario.

Features

  • Provides unified interfaces for Active Learning so that you can easily mix and match query strategies with classifiers provided by sklearn, Pytorch, or transformers.
  • Supports GPU-based Pytorch models and integrates transformers so that you can use state-of-the-art Text Classification models for Active Learning.
  • GPU is supported but not required. In case of a CPU-only use case, a lightweight installation only requires a minimal set of dependencies.
  • Multiple scientifically evaluated components are pre-implemented and ready to use (Query Strategies, Initialization Strategies, and Stopping Criteria).

News

  • The small-text paper was awarded Best System Demonstration at EACL 2023 πŸŽ‰

  • Version 1.3.0 (v1.3.0): Highlights - February 20th, 2023

  • Version 1.2.0 (v1.2.0): Highlights - February 4th, 2023

    • Make huggingface/setfit (SetFit) usable as a small-text classifier.
    • New query strategy: BALD.
    • Added two new SetFit notebooks, and also updated existing notebooks.
  • Version 1.1.1 (v1.1.1) - October 14, 2022

    • Fixes model selection which could raise an error under certain circumstances (#21).

For a complete list of changes, see the change log.

Installation

Small-Text can be easily installed via pip:

pip install small-text

For a full installation include the transformers extra requirement:

pip install small-text[transformers]

It requires Python 3.7 or newer. For using the GPU, CUDA 10.1 or newer is required. More information regarding the installation can be found in the documentation.

Quick Start

For a quick start, see the provided examples for binary classification, pytorch multi-class classification, and transformer-based multi-class classification, or check out the notebooks.

Notebooks

# Notebook
1 Intro: Active Learning for Text Classification with Small-Text Open In Colab
2 Using Stopping Criteria for Active Learning Open In Colab
3 Active Learning using SetFit Open In Colab
4 Using SetFit's Zero Shot Capabilities for Cold Start Initialization Open In Colab

Showcase

A full list of showcases can be found in the docs.

πŸŽ€ Would you like to share your use case? Regardless if it is a paper, an experiment, a practical application, a thesis, a dataset, or other, let us know and we will add you to the showcase section or even here.

Documentation

Read the latest documentation here. Noteworthy pages include:

Alternatives

modAL, ALiPy, libact

Contribution

Contributions are welcome. Details can be found in CONTRIBUTING.md.

Acknowledgments

This software was created by Christopher SchrΓΆder (@chschroeder) at Leipzig University's NLP group which is a part of the Webis research network. The encompassing project was funded by the Development Bank of Saxony (SAB) under project number 100335729.

Citation

Small-Text has been introduced in detail in the EACL23 System Demonstration Paper "Small-Text: Active Learning for Text Classification in Python" which can be cited as follows:

@inproceedings{schroeder2023small-text,
    title = "Small-Text: Active Learning for Text Classification in Python",
    author = {Schr{\"o}der, Christopher  and  M{\"u}ller, Lydia  and  Niekler, Andreas  and  Potthast, Martin},
    booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.eacl-demo.11",
    pages = "84--95"
}

License

MIT License

More Repositories

1

summary-explorer

Summary Explorer is a tool to visually explore the state-of-the-art in text summarization.
CSS
43
star
2

ECIR-2015-and-SEMEVAL-2015

The experiment software underlying two papers published at ECIR-2015 and SEMEVAL-2015.
Java
37
star
3

summary-workbench

Framework for unified summarisation and evaluation of English documents using state-of-the-art models and measures.
Python
31
star
4

ecir21-an-empirical-comparison-of-web-page-segmentation-algorithms

JavaScript
26
star
5

wasp

Java
25
star
6

archive-query-log

πŸ“œ The Archive Query Log.
Jupyter Notebook
22
star
7

acl22-identifying-the-human-values-behind-arguments

Machine Learning scripts for the identification of human values behind arguments.
Python
22
star
8

ir_axioms

↕️ Intuitive axiomatic retrieval experimentation.
Python
22
star
9

ACL-22

16
star
10

ACL-18

Java
15
star
11

webis-tldr-17-corpus

Code for constructing TLDR corpus from Reddit dataset
Python
15
star
12

lightning-ir

Python
14
star
13

cikm20-web-page-segmentation-revisited-evaluation-framework-and-dataset

Code for "Web Page Segmentation Revisited: Evaluation Framework and Dataset", accepted as resources paper to CIKM 2020
HTML
13
star
14

set-encoder

Jupyter Notebook
13
star
15

mturk-manager

An alternative front end for Amazon Mechanical Turk
Vue
12
star
16

scidata22-stereo-scientific-text-reuse

Go
11
star
17

msmarco-llm-distillation

Python
11
star
18

ijcai24-manipulating-embeddings-stable-diffusion

Code for the paper "Manipulating Embeddings of Stable Diffusion Prompts".
Python
10
star
19

DADT

Implementation of Disjoint Author-Document Topic Model
Python
9
star
20

webis-de.github.io

The Webis Group Website.
HTML
8
star
21

corpus-viewer

Python
8
star
22

coling22-benchmark-for-causal-question-answering

Jupyter Notebook
8
star
23

unmasking

General-purpose Unmasking Framework
Python
8
star
24

waka

Construct and author knowledge graphs from text.
Python
7
star
25

ML4CD-21

Code repository for "BERTian Poetics: Constrained Composition with Masked LMs"
Jupyter Notebook
6
star
26

SIGIR-17

Java
6
star
27

ecir24-seo-spam-in-search-engines

Jupyter Notebook
6
star
28

ECIR-24

6
star
29

argmining-21-keypoint-analysis-sharedtask-code

The code for the our submission for the key point analysis sharedtask (2021)
Jupyter Notebook
5
star
30

lecture.js

Lecture.js converts a script and slides to a spoken video presentation using advanced text-to-speech services
JavaScript
5
star
31

acl22-clickbait-spoiling

Jupyter Notebook
5
star
32

ARGMINING-17

The repository for the paper, Unit Segmentation of Argumentative Texts. In ArgMining 2017
Python
5
star
33

scriptor

Plug-and-play reproducible web analysis.
JavaScript
5
star
34

wat

Web Annotation Tool
Java
4
star
35

acl22-revisiting-uncertainty-based-query-strategies-for-active-learning-with-transformers

Revisiting Uncertainty-based Query Strategies for Active Learning with Transformers
Python
4
star
36

mastodon-search

πŸ•ΈοΈ A Corpus for Simulating Search on Mastodon.
Jupyter Notebook
4
star
37

eacl21-belief-based-claim-generation

Jupyter Notebook
4
star
38

NLPCSS-20

The repository of the NLPCSS 2020 paper
3
star
39

acl20-target-inference-in-conclusion-generation

Python
3
star
40

ACL-20

Central repository of all ACL'20 publications by the Webis group.
3
star
41

downloads

The downloads directory for the webis.de web page. History will be deleted irregularly.
Python
3
star
42

ecir22-anchor-text

Code and Data for the paper on anchor text for MS Marco.
Jupyter Notebook
3
star
43

webis-web-archiver

Source code and scripts for the Webis Web Archiver
Java
3
star
44

ACL-19

ACL 2019 Code and Data
Python
3
star
45

webis-de-archive

Splash page for archive.webis.de
CSS
3
star
46

emnlp21-same-sentiment

EMNLP 2021 - Casting the Same Sentiment Classification Problem
Jupyter Notebook
3
star
47

SIGIR-19

Repository for the SIGIR'19 paper "Argument Search: Assessing Argument Relevance."
Jupyter Notebook
3
star
48

natural-language-processing-exercises

Python
3
star
49

ecir24-simulating-follow-up-questions

Python
3
star
50

GenIRSim

JavaScript
3
star
51

EMNLP-23

3
star
52

sommercamp

πŸ•οΈ Building a search engine from scratch.
Python
3
star
53

IJCAI-21

Code for the paper "Bias Silhouette Analysis: Towards Assessing the Quality of Bias Metrics for Word Embedding Models".
Python
2
star
54

ICWSM-17

Java
2
star
55

acl21-counter-argument-generation-by-attacking-weak-premises

Jupyter Notebook
2
star
56

COLING-20

HTML
2
star
57

slidehub

Generic code for slidehub pages
HTML
2
star
58

aitools4-aq-web-page-content-extraction

Java
2
star
59

ECIR-23

Roff
2
star
60

acl21-ArgKG-argument-generation

2
star
61

ArgMining-20

2
star
62

aitools4-aq-geolocation

Java
2
star
63

pytorch-window-matmul

a custom CUDA kernel for windowed matrix multiplication
Python
2
star
64

COLING-22

2
star
65

QPP-23

Jupyter Notebook
2
star
66

argmining19-same-side-classification

The Benchmarking Workshop
Jupyter Notebook
2
star
67

ecir22-query-obfuscation-game

HTML
2
star
68

ACL-23

2
star
69

password-generation-rules

Java
2
star
70

argmining20-social-bias-argumentation

Code for the paper "Argument from Old Man’s View: Assessing Social Bias in Argumentation".
Python
2
star
71

sigir20-sampling-bias-due-to-near-duplicates-in-learning-to-rank

Sampling Bias Due to Near-Duplicates in Learning to Rank
Kotlin
2
star
72

ECIR-19

Python
2
star
73

authorship-threetrain

Implementation of the tri-training algorithm for authorship attribution described in a paper by Qian et al. 2014
Python
2
star
74

acl20-crawling-mailing-lists

Python
2
star
75

emnlp22-social-bias-representation-accuracy

Code and data for the paper "No Word Embedding Model Is Perfect: Evaluating the Representation Accuracy for Social Bias in the Media" published at EMNLP 2022.
Python
2
star
76

ICTIR-22

Repository for the paper "Sparse Pairwise Re-ranking with Pre-trained Transformers" published at ICTIR 2022.
Jupyter Notebook
2
star
77

ecir24-sparse-cross-encoder

Code and models for the ECIR'24 paper 'Investigating the Effects of Sparse Attention on Cross-Encoders'
Jupyter Notebook
2
star
78

emnlp21-same-stance

EMNLP 2021 - On Classifying whether Two Texts are on the Same Side of an Argument
Jupyter Notebook
2
star
79

acl22-moral-debater-a-study-on-the-computational-generation-of-morally-framed-arguments

Jupyter Notebook
2
star
80

FIGLANG-22

Repository containing code for the paper on identification of source domains by contrastive learning
Python
2
star
81

SIGIR-18

The repository for the data in the SIGIR paper "A User Study on Snippet Generation: Text Reuse vs. Paraphrases"
Python
2
star
82

acl21-informative-conclusion-generation

Jupyter Notebook
1
star
83

SPIRE-22

This repository contains the code and the data for our SPIRE'22 paper on unintended train--test leakage with neural retrieval models.
Java
1
star
84

in2writing22-language-models-as-context-sensitive-word-search-engines

Python
1
star
85

tpdl22-visual-web-archive-quality-assessment

Java
1
star
86

webis-de-assets

Generic Webis Website Assets
SCSS
1
star
87

ACL-21

1
star
88

semeval19-hyperpartisan-news-detection-article-cleaner

Code for cleaning the HTML of articles
Java
1
star
89

cikm20-ndcg-negative-relevance-judgements

Code for the CIKM20 Short Paper: "The Impact of Negative Relevance Judgments on NDCG"
Jupyter Notebook
1
star
90

argmining21-frame-identification

1
star
91

targer-api

πŸ—£οΈ Simple, type-safe access to the TARGER neural argument tagging APIs.
Python
1
star
92

acl19-heuristic-authorship-obfuscation

C++
1
star
93

EACL-23

EACL-23 Code and Data
1
star
94

eacl23-conclusion-based-counter-argument-generation

Jupyter Notebook
1
star
95

EMNLP-22

1
star
96

WWW-20

The repository of the Webconf paper "Abstractive Snippet Generation"
Python
1
star
97

AMOC-21

This repository documents our entry in the 2021 Amoc Hackathon
Jupyter Notebook
1
star
98

SAMESIDE-19

SameSideClassification Source Code (Fork ASV)
Jupyter Notebook
1
star
99

argmining20-rhetorical-devices

Java
1
star
100

koppel14

Tries to implement the algorithms in the paper 'Determining if two Documents are written by the same author' by Koppel and Winter from 2014
Jupyter Notebook
1
star