• Stars
    star
    289
  • Rank 143,419 (Top 3 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created over 2 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Gain clues from clustering!

cluestar

Gain a clue by clustering!

This library contains visualisation tools that might help you get started with classification tasks. The idea is that if you can inspect clusters easily, you might gain a clue on what good labels for your dataset might be!

It generates charts that looks like this:

Install

python -m pip install cluestar

Interactive Demo

You can see an interactive demo of the generated widgets here.

You can also toy around with the demo notebook found here.

Usage

The first step is to encode textdata in two dimensions, like below.

from sklearn.pipeline import make_pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer

pipe = make_pipeline(TfidfVectorizer(), TruncatedSVD(n_components=2))

X = pipe.fit_transform(texts)

From here you can make an interactive chart via;

from cluestar import plot_text

plot_text(X, texts)

The best results are likely found when you use umap together with something like universal sentence encoder.

You might also improve the understandability by highlighting points that have a certain word in it.

plot_text(X, texts, color_words=["plastic", "voucher", "deliver"])

You can also use a numeric array, one that contains proba-values for prediction, to influence the color.

# First, get an array of pvals from some model
p_vals = some_model.predict(texts)[:, 0]
# Use these to assign pretty colors.
plot_text(X, texts, color_array=p_vals)

More Repositories

1

scikit-lego

Extra blocks for scikit-learn pipelines.
Python
1,136
star
2

human-learn

Natural Intelligence is still a pretty good idea.
Jupyter Notebook
764
star
3

drawdata

Draw datasets from within Jupyter.
Python
579
star
4

doubtlab

Doubt your data, find bad labels.
Python
485
star
5

whatlies

Toolkit to help understand "what lies" in word embeddings. Also benchmarking!
Python
468
star
6

bulk

A Simple Bulk Labelling Tool
Python
424
star
7

embetter

just a bunch of useful embeddings
Python
381
star
8

calm-notebooks

notebooks that are used at calmcode.io
Jupyter Notebook
176
star
9

clumper

A small python library that can clump lists of data together.
Python
144
star
10

simsity

Super Simple Similarities Service
Python
141
star
11

memo

Decorators that logs stats.
Python
101
star
12

mktestdocs

Run pytest against markdown files/docstrings.
Python
99
star
13

spacy-youtube-material

Here are the notebooks used during the spacy youtube series.
Jupyter Notebook
96
star
14

tuilwindcss

Very much like Tailwind, but for TUI frameworks in Textual.
CSS
70
star
15

tokenwiser

Bag of, not words, but tricks!
Python
67
star
16

skedulord

captures logs and makes cron more fun
Python
65
star
17

pytest-duration-insights

A mini dashboard to help find slow tests in pytest.
Python
57
star
18

arxiv-frontpage

My personal frontpage app
HTML
46
star
19

scikit-partial

Pipeline components that support partial_fit.
Python
35
star
20

scikit-fairness

this repo might get accepted
Python
29
star
21

spacy-report

Generate reports for spaCy models.
Python
28
star
22

brent

bayesian graphical modelling and a bit of do-calculus for discrete data.
Jupyter Notebook
27
star
23

icepickle

It's a cooler way to store simple linear models.
Python
26
star
24

koaning

21
star
25

justcharts

Just charts. Really.
HTML
21
star
26

scikit-prune

Prune your sklearn models
Python
19
star
27

thismonth.rocks

motivational website to do something special this month
CSS
18
star
28

sentimany

Just another sentiment wrapper.
Python
17
star
29

kadro

A friendly pandas wrapper with a more composable grammar support.
Jupyter Notebook
14
star
30

prodigy-tui

A textual TUI for Prodigy
CSS
13
star
31

calmcode-feedback

A repo to collect issues with calmcode.io
12
star
32

open_notebooks

Some notebooks that I've shared.
Jupyter Notebook
12
star
33

sentence-models

A different, but useful, textcat approach.
Python
11
star
34

paftdunk

Recommendin' all night to get lucky.
Jupyter Notebook
6
star
35

proglang-project

Python
6
star
36

scikit-teach

Active Learning Benchmarks
Jupyter Notebook
6
star
37

texttoolz

tools and tricks that are good to have around
5
star
38

makefile-demo

just a demo of a makefile in action
Makefile
5
star
39

gitlit

Streamlit App on Github Actions
Python
5
star
40

kolektor

Let's give this git-scraping a try.
Python
5
star
41

optimal-on-paper

broken in reality
Jupyter Notebook
5
star
42

liBERTy

A benchmark to compare BERT against sklearn.
Python
5
star
43

classycookie

cookiecutter to run standard text classifiers
Python
5
star
44

lazylines

Pipelines for JSONL files
Python
4
star
45

salary-bias

just another dangerous situation
Jupyter Notebook
4
star
46

dql101

A 101 repo with some code for openai Deep Q Learning
Jupyter Notebook
4
star
47

boondoc

lightweight Python API docs for markdown
Python
4
star
48

subspacy

BPEmb embeddings for spaCy
Python
4
star
49

akin

Some text similarity utilities
Python
4
star
50

calm-stats

Some GitScrapers
Python
3
star
51

calmcode-datasets

Just a Collection of Datasets
3
star
52

koaning-old.github.io

my personal blog
HTML
3
star
53

sushigo

An OpenAi-like environment for the sushi go card game.
Python
3
star
54

featherbed

Very lightweight text vectors via tf/idf + SVD
Python
3
star
55

onnx-demo

onnx seems interesting
Jupyter Notebook
3
star
56

benchmarks

Collection of benchmarks
Jupyter Notebook
3
star
57

baseliner

baseliner offers simple models that can act as a baseline to compare against
R
3
star
58

spacy-intent-example

intent prediction example on spaCy v3
Python
3
star
59

scikit-bloom

Bloom tricks for text pipelines in scikit-learn.
Python
3
star
60

github-slideshow

A robot powered training repository 🤖
HTML
2
star
61

wordlists

Just a bunch of potentially useful wordlists.
2
star
62

gli

my gleeful scripts for the cli
Python
2
star
63

labeltable

Things for bulk labelling.
Python
2
star
64

fusebox

Finetune-able Universal Sentence Encoder
Jupyter Notebook
2
star
65

subsette

A dash-boarding environment for datasette.
HTML
2
star
66

manyterms

Many terms for whatever purposes (weak labelling)
2
star
67

sentency

Lightweight SpaCy pipeline to detect sentences.
2
star
68

pydata-slovenia-talk

Bag of NLP Tricks!
Jupyter Notebook
2
star
69

helloworld

a helloworld package that should just work
R
2
star
70

uvnb

Have UV deal with all your Jupyter deps.
Jupyter Notebook
2
star
71

blackjack

a simple pytest demo
Python
2
star
72

demopkg

a demo pkg in R with github actions
R
2
star
73

lamarl

sushigo simulations on an aws backend
Python
2
star
74

wow-avatar-datasets

A place to host some parquet files.
2
star
75

python_data_intro

A beginner notebook for people who want to get started with python and data. Joy ensues!
Jupyter Notebook
2
star
76

buggingface

Let's see what we can learn from poking huggingface models.
1
star
77

digital-potato

HTML
1
star
78

gha-demo

Demo application for GitHub Actions tutorial.
Python
1
star
79

fastfood-bot

a rasa demo that can find you a fast food location
1
star
80

ecosystem-watcher

Just keeping an eye on the ecosystem.
Python
1
star
81

git-scrape-unravel

CLI to unravel git-scraped code.
1
star
82

scikit-prodigy

Helpers to leverage scikit-learn pipelines in Prodigy.
Python
1
star
83

skooba

less weak supervision
1
star
84

rasa-nlu-deploy

A demo that can run Rasa NLU in a container.
Python
1
star
85

datasette-parcoords

Parallel coordinates chart for datasette
JavaScript
1
star
86

nlu-cluster-demo

Upload your model file and talk to it!
Jupyter Notebook
1
star
87

tjek

tjek changes with the main branch
Python
1
star
88

katacoda-scenarios

Katacoda Scenarios
1
star
89

bulk-datasets

Helpers for the download command.
1
star
90

there-are-no-bad-labels

Repo for the PyData 2023 Workshop
Jupyter Notebook
1
star
91

tokenvolt

Populate an embedding cache quickly and get on with your day.
Python
1
star
92

rusty

Learning how to Rst
1
star
93

uvtrick

I really outdid myself with this hack.
Python
1
star
94

ollama-railway

Just to see if this might work out well.
Python
1
star