• Stars
    star
    531
  • Rank 82,927 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created over 3 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Active Learning for Text Classification in Python

PyPI Conda Forge codecov Documentation Status Maintained Yes Contributions Welcome MIT License DOI Twitter URL

small-text logo

Active Learning for Text Classification in Python.


Installation | Quick Start | Contribution | Changelog | Docs

Small-Text provides state-of-the-art Active Learning for Text Classification. Several pre-implemented Query Strategies, Initialization Strategies, and Stopping Critera are provided, which can be easily mixed and matched to build active learning experiments or applications.

What is Active Learning?
Active Learning allows you to efficiently label training data in a small data scenario.

Features

  • Provides unified interfaces for Active Learning so that you can easily mix and match query strategies with classifiers provided by sklearn, Pytorch, or transformers.
  • Supports GPU-based Pytorch models and integrates transformers so that you can use state-of-the-art Text Classification models for Active Learning.
  • GPU is supported but not required. In case of a CPU-only use case, a lightweight installation only requires a minimal set of dependencies.
  • Multiple scientifically evaluated components are pre-implemented and ready to use (Query Strategies, Initialization Strategies, and Stopping Criteria).

News

  • The small-text paper was awarded Best System Demonstration at EACL 2023 πŸŽ‰

  • Version 1.3.0 (v1.3.0): Highlights - February 20th, 2023

  • Version 1.2.0 (v1.2.0): Highlights - February 4th, 2023

    • Make huggingface/setfit (SetFit) usable as a small-text classifier.
    • New query strategy: BALD.
    • Added two new SetFit notebooks, and also updated existing notebooks.
  • Version 1.1.1 (v1.1.1) - October 14, 2022

    • Fixes model selection which could raise an error under certain circumstances (#21).

For a complete list of changes, see the change log.

Installation

Small-Text can be easily installed via pip:

pip install small-text

For a full installation include the transformers extra requirement:

pip install small-text[transformers]

It requires Python 3.7 or newer. For using the GPU, CUDA 10.1 or newer is required. More information regarding the installation can be found in the documentation.

Quick Start

For a quick start, see the provided examples for binary classification, pytorch multi-class classification, and transformer-based multi-class classification, or check out the notebooks.

Notebooks

# Notebook
1 Intro: Active Learning for Text Classification with Small-Text Open In Colab
2 Using Stopping Criteria for Active Learning Open In Colab
3 Active Learning using SetFit Open In Colab
4 Using SetFit's Zero Shot Capabilities for Cold Start Initialization Open In Colab

Showcase

A full list of showcases can be found in the docs.

πŸŽ€ Would you like to share your use case? Regardless if it is a paper, an experiment, a practical application, a thesis, a dataset, or other, let us know and we will add you to the showcase section or even here.

Documentation

Read the latest documentation here. Noteworthy pages include:

Alternatives

modAL, ALiPy, libact

Contribution

Contributions are welcome. Details can be found in CONTRIBUTING.md.

Acknowledgments

This software was created by Christopher SchrΓΆder (@chschroeder) at Leipzig University's NLP group which is a part of the Webis research network. The encompassing project was funded by the Development Bank of Saxony (SAB) under project number 100335729.

Citation

Small-Text has been introduced in detail in the EACL23 System Demonstration Paper "Small-Text: Active Learning for Text Classification in Python" which can be cited as follows:

@inproceedings{schroeder2023small-text,
    title = "Small-Text: Active Learning for Text Classification in Python",
    author = {Schr{\"o}der, Christopher  and  M{\"u}ller, Lydia  and  Niekler, Andreas  and  Potthast, Martin},
    booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.eacl-demo.11",
    pages = "84--95"
}

License

MIT License

More Repositories

1

summary-explorer

Summary Explorer is a tool to visually explore the state-of-the-art in text summarization.
CSS
43
star
2

ECIR-2015-and-SEMEVAL-2015

The experiment software underlying two papers published at ECIR-2015 and SEMEVAL-2015.
Java
37
star
3

summary-workbench

Framework for unified summarisation and evaluation of English documents using state-of-the-art models and measures.
Python
31
star
4

ecir21-an-empirical-comparison-of-web-page-segmentation-algorithms

JavaScript
26
star
5

wasp

Java
25
star
6

archive-query-log

πŸ“œ The Archive Query Log.
Jupyter Notebook
22
star
7

acl22-identifying-the-human-values-behind-arguments

Machine Learning scripts for the identification of human values behind arguments.
Python
22
star
8

ir_axioms

↕️ Intuitive axiomatic retrieval experimentation.
Python
22
star
9

ACL-22

15
star
10

ACL-18

Java
15
star
11

webis-tldr-17-corpus

Code for constructing TLDR corpus from Reddit dataset
Python
15
star
12

cikm20-web-page-segmentation-revisited-evaluation-framework-and-dataset

Code for "Web Page Segmentation Revisited: Evaluation Framework and Dataset", accepted as resources paper to CIKM 2020
HTML
13
star
13

mturk-manager

An alternative front end for Amazon Mechanical Turk
Vue
12
star
14

scidata22-stereo-scientific-text-reuse

Go
10
star
15

msmarco-llm-distillation

Python
10
star
16

lightning-ir

Python
10
star
17

DADT

Implementation of Disjoint Author-Document Topic Model
Python
9
star
18

set-encoder

Jupyter Notebook
9
star
19

ijcai24-manipulating-embeddings-stable-diffusion

Code for the paper "Manipulating Embeddings of Stable Diffusion Prompts".
Python
8
star
20

webis-de.github.io

The Webis Group Website.
HTML
8
star
21

corpus-viewer

Python
8
star
22

coling22-benchmark-for-causal-question-answering

Jupyter Notebook
8
star
23

unmasking

General-purpose Unmasking Framework
Python
8
star
24

ML4CD-21

Code repository for "BERTian Poetics: Constrained Composition with Masked LMs"
Jupyter Notebook
6
star
25

SIGIR-17

Java
6
star
26

scriptor

Plug-and-play reproducible web analysis.
JavaScript
6
star
27

ECIR-24

6
star
28

argmining-21-keypoint-analysis-sharedtask-code

The code for the our submission for the key point analysis sharedtask (2021)
Jupyter Notebook
5
star
29

lecture.js

Lecture.js converts a script and slides to a spoken video presentation using advanced text-to-speech services
JavaScript
5
star
30

ARGMINING-17

The repository for the paper, Unit Segmentation of Argumentative Texts. In ArgMining 2017
Python
5
star
31

waka

Construct and author knowledge graphs from text.
Python
5
star
32

ecir24-seo-spam-in-search-engines

Jupyter Notebook
5
star
33

wat

Web Annotation Tool
Java
4
star
34

acl22-revisiting-uncertainty-based-query-strategies-for-active-learning-with-transformers

Revisiting Uncertainty-based Query Strategies for Active Learning with Transformers
Python
4
star
35

acl22-clickbait-spoiling

Jupyter Notebook
4
star
36

eacl21-belief-based-claim-generation

Jupyter Notebook
4
star
37

NLPCSS-20

The repository of the NLPCSS 2020 paper
3
star
38

acl20-target-inference-in-conclusion-generation

Python
3
star
39

ACL-20

Central repository of all ACL'20 publications by the Webis group.
3
star
40

downloads

The downloads directory for the webis.de web page. History will be deleted irregularly.
Python
3
star
41

ecir22-anchor-text

Code and Data for the paper on anchor text for MS Marco.
Jupyter Notebook
3
star
42

webis-web-archiver

Source code and scripts for the Webis Web Archiver
Java
3
star
43

ACL-19

ACL 2019 Code and Data
Python
3
star
44

mastodon-search

πŸ•ΈοΈ A Corpus for Simulating Search on Mastodon.
Jupyter Notebook
3
star
45

webis-de-archive

Splash page for archive.webis.de
CSS
3
star
46

emnlp21-same-sentiment

EMNLP 2021 - Casting the Same Sentiment Classification Problem
Jupyter Notebook
3
star
47

SIGIR-19

Repository for the SIGIR'19 paper "Argument Search: Assessing Argument Relevance."
Jupyter Notebook
3
star
48

natural-language-processing-exercises

Python
3
star
49

EMNLP-23

3
star
50

IJCAI-21

Code for the paper "Bias Silhouette Analysis: Towards Assessing the Quality of Bias Metrics for Word Embedding Models".
Python
2
star
51

COLING-20

HTML
2
star
52

ICWSM-17

Java
2
star
53

acl21-counter-argument-generation-by-attacking-weak-premises

Jupyter Notebook
2
star
54

slidehub

Generic code for slidehub pages
HTML
2
star
55

aitools4-aq-web-page-content-extraction

Java
2
star
56

ECIR-23

Roff
2
star
57

acl21-ArgKG-argument-generation

2
star
58

ArgMining-20

2
star
59

aitools4-aq-geolocation

Java
2
star
60

pytorch-window-matmul

a custom CUDA kernel for windowed matrix multiplication
Python
2
star
61

COLING-22

2
star
62

QPP-23

Jupyter Notebook
2
star
63

argmining19-same-side-classification

The Benchmarking Workshop
Jupyter Notebook
2
star
64

ecir22-query-obfuscation-game

HTML
2
star
65

ACL-23

2
star
66

password-generation-rules

Java
2
star
67

argmining20-social-bias-argumentation

Code for the paper "Argument from Old Man’s View: Assessing Social Bias in Argumentation".
Python
2
star
68

sigir20-sampling-bias-due-to-near-duplicates-in-learning-to-rank

Sampling Bias Due to Near-Duplicates in Learning to Rank
Kotlin
2
star
69

ECIR-19

Python
2
star
70

authorship-threetrain

Implementation of the tri-training algorithm for authorship attribution described in a paper by Qian et al. 2014
Python
2
star
71

acl20-crawling-mailing-lists

Python
2
star
72

ecir24-sparse-cross-encoder

Code and models for the ECIR'24 paper 'Investigating the Effects of Sparse Attention on Cross-Encoders'
Jupyter Notebook
2
star
73

ecir24-simulating-follow-up-questions

Python
2
star
74

ICTIR-22

Repository for the paper "Sparse Pairwise Re-ranking with Pre-trained Transformers" published at ICTIR 2022.
Jupyter Notebook
2
star
75

emnlp21-same-stance

EMNLP 2021 - On Classifying whether Two Texts are on the Same Side of an Argument
Jupyter Notebook
2
star
76

acl22-moral-debater-a-study-on-the-computational-generation-of-morally-framed-arguments

Jupyter Notebook
2
star
77

SIGIR-18

The repository for the data in the SIGIR paper "A User Study on Snippet Generation: Text Reuse vs. Paraphrases"
Python
2
star
78

acl21-informative-conclusion-generation

Jupyter Notebook
1
star
79

SPIRE-22

This repository contains the code and the data for our SPIRE'22 paper on unintended train--test leakage with neural retrieval models.
Java
1
star
80

in2writing22-language-models-as-context-sensitive-word-search-engines

Python
1
star
81

tpdl22-visual-web-archive-quality-assessment

Java
1
star
82

webis-de-assets

Generic Webis Website Assets
SCSS
1
star
83

ACL-21

1
star
84

semeval19-hyperpartisan-news-detection-article-cleaner

Code for cleaning the HTML of articles
Java
1
star
85

cikm20-ndcg-negative-relevance-judgements

Code for the CIKM20 Short Paper: "The Impact of Negative Relevance Judgments on NDCG"
Jupyter Notebook
1
star
86

argmining21-frame-identification

1
star
87

targer-api

πŸ—£οΈ Simple, type-safe access to the TARGER neural argument tagging APIs.
Python
1
star
88

acl19-heuristic-authorship-obfuscation

C++
1
star
89

EACL-23

EACL-23 Code and Data
1
star
90

eacl23-conclusion-based-counter-argument-generation

Jupyter Notebook
1
star
91

EMNLP-22

1
star
92

WWW-20

The repository of the Webconf paper "Abstractive Snippet Generation"
Python
1
star
93

AMOC-21

This repository documents our entry in the 2021 Amoc Hackathon
Jupyter Notebook
1
star
94

SAMESIDE-19

SameSideClassification Source Code (Fork ASV)
Jupyter Notebook
1
star
95

argmining20-rhetorical-devices

Java
1
star
96

koppel14

Tries to implement the algorithms in the paper 'Determining if two Documents are written by the same author' by Koppel and Winter from 2014
Jupyter Notebook
1
star
97

bea24-essay-feedback-generation

Python
1
star
98

naacl24-school-student-essay-corpus

Jupyter Notebook
1
star
99

RENEUIR-24

The participation of the FSU team at ReNeuIR 2024.
Jupyter Notebook
1
star
100

WWW-24

The repository of the WWW'2024 paper "Detecting Generated Native Ads in Conversational Search"
1
star