• This repository has been archived on 07/Feb/2023
  • Stars
    star
    468
  • Rank 93,767 (Top 2 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 4 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Toolkit to help understand "what lies" in word embeddings. Also benchmarking!

Downloads

archival notice

This was a fun project for a while, but it's become a pain to maintain all the different backends. If you're looking for visualisation tools, check https://github.com/koaning/cluestar and consider https://github.com/koaning/cluestar if you're interested in the embeddings going forward.

whatlies

A library that tries to help you to understand (note the pun).

"What lies in word embeddings?"

This small library offers tools to make visualisation easier of both word embeddings as well as operations on them.

Produced

This project was initiated at Rasa as a by-product of our efforts in the developer advocacy and research teams. The project is maintained by koaning in order to support more use-cases.

Features

This library has tools to help you understand what lies in word embeddings. This includes:

  • simple tools to create (interactive) visualisations
  • support for many language backends including spaCy, fasttext, tfhub, huggingface and bpemb
  • lightweight scikit-learn featurizer support for all these backends

Installation

You can install the package via pip;

pip install whatlies

This will install the base dependencies. Depending on the transformers and language backends that you'll be using you may want to install more. Here's some of the possible installation settings you could go for.

pip install whatlies[spacy]
pip install whatlies[tfhub]
pip install whatlies[transformers]

If you want it all you can also install via;

pip install whatlies[all]

Note that this will install dependencies but it will not install all the language models you might want to visualise. For example, you might still need to manually download spaCy models if you intend to use that backend.

Getting Started

More in depth getting started guides can be found on the documentation page.

Examples

The idea is that you can load embeddings from a language backend and use mathematical operations on it.

from whatlies import EmbeddingSet
from whatlies.language import SpacyLanguage

lang = SpacyLanguage("en_core_web_md")
words = ["cat", "dog", "fish", "kitten", "man", "woman",
         "king", "queen", "doctor", "nurse"]

emb = EmbeddingSet(*[lang[w] for w in words])
emb.plot_interactive(x_axis=emb["man"], y_axis=emb["woman"])

You can even do fancy operations. Like projecting onto and away from vector embeddings! You can perform these on embeddings as well as sets of embeddings. In the example below we attempt to filter away gender bias using linear algebra operations.

orig_chart = emb.plot_interactive('man', 'woman')

new_ts = emb | (emb['king'] - emb['queen'])
new_chart = new_ts.plot_interactive('man', 'woman')

There's also things like pca and umap.

from whatlies.transformers import Pca, Umap

orig_chart = emb.plot_interactive('man', 'woman')
pca_plot = emb.transform(Pca(2)).plot_interactive()
umap_plot = emb.transform(Umap(2)).plot_interactive()

pca_plot | umap_plot

Scikit-Learn Support

Every language backend in this video is available as a scikit-learn featurizer as well.

import numpy as np
from whatlies.language import BytePairLanguage
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ("embed", BytePairLanguage("en")),
    ("model", LogisticRegression())
])

X = [
    "i really like this post",
    "thanks for that comment",
    "i enjoy this friendly forum",
    "this is a bad post",
    "i dislike this article",
    "this is not well written"
]

y = np.array([1, 1, 1, 0, 0, 0])

pipe.fit(X, y)

Documentation

To learn more and for a getting started guide, check out the documentation.

Similar Projects

There are some similar projects out and we figured it fair to mention and compare them here.

Julia Bazińska & Piotr Migdal Web App

The original inspiration for this project came from this web app and this pydata talk. It is a web app that takes a while to load but it is really fun to play with. The goal of this project is to make it easier to make similar charts from jupyter using different language backends.

Tensorflow Projector

From google there's the tensorflow projector project. It offers highly interactive 3d visualisations as well as some transformations via tensorboard.

  • The tensorflow projector will create projections in tensorboard, which you can also load into jupyter notebook but whatlies makes visualisations directly.
  • The tensorflow projector supports interactive 3d visuals, which whatlies currently doesn't.
  • Whatlies offers lego bricks that you can chain together to get a visualisation started. This also means that you're more flexible when it comes to transforming data before visualising it.
Parallax

From Uber AI Labs there's parallax which is described in a paper here. There's a common mindset in the two tools; the goal is to use arbitrary user defined projections to understand embedding spaces better. That said, some differences that are worth to mention.

  • It relies on bokeh as a visualisation backend and offers a lot of visualisation types (like radar plots). Whatlies uses altair and tries to stick to simple scatter charts. Altair can export interactive html/svg but it will not scale as well if you've drawing many points at the same time.
  • Parallax is meant to be run as a stand-alone app from the command line while Whatlies is meant to be run from the jupyter notebook.
  • Parallax gives a full user interface while Whatlies offers lego bricks that you can chain together to get a visualisation started.
  • Whatlies relies on language backends (like spaCy, huggingface) to fetch word embeddings. Parallax allows you to instead fetch raw files on disk.
  • Parallax has been around for a while, Whatlies is more new and therefore more experimental.

Local Development

If you want to develop locally you can start by running this command.

make develop

Documentation

This is generated via

make docs

Citation

Please use the following citation when you found whatlies helpful for any of your work (find the whatlies paper here):

@inproceedings{warmerdam-etal-2020-going,
    title = "Going Beyond {T}-{SNE}: Exposing whatlies in Text Embeddings",
    author = "Warmerdam, Vincent  and
      Kober, Thomas  and
      Tatman, Rachael",
    booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.nlposs-1.8",
    doi = "10.18653/v1/2020.nlposs-1.8",
    pages = "52--60",
    abstract = "We introduce whatlies, an open source toolkit for visually inspecting word and sentence embeddings. The project offers a unified and extensible API with current support for a range of popular embedding backends including spaCy, tfhub, huggingface transformers, gensim, fastText and BytePair embeddings. The package combines a domain specific language for vector arithmetic with visualisation tools that make exploring word embeddings more intuitive and concise. It offers support for many popular dimensionality reduction techniques as well as many interactive visualisations that can either be statically exported or shared via Jupyter notebooks. The project documentation is available from https://koaning.github.io/whatlies/.",
}

More Repositories

1

scikit-lego

Extra blocks for scikit-learn pipelines.
Python
1,136
star
2

human-learn

Natural Intelligence is still a pretty good idea.
Jupyter Notebook
764
star
3

drawdata

Draw datasets from within Jupyter.
Python
579
star
4

doubtlab

Doubt your data, find bad labels.
Python
485
star
5

bulk

A Simple Bulk Labelling Tool
Python
424
star
6

embetter

just a bunch of useful embeddings
Python
381
star
7

cluestar

Gain clues from clustering!
Jupyter Notebook
289
star
8

calm-notebooks

notebooks that are used at calmcode.io
Jupyter Notebook
176
star
9

clumper

A small python library that can clump lists of data together.
Python
144
star
10

simsity

Super Simple Similarities Service
Python
141
star
11

memo

Decorators that logs stats.
Python
101
star
12

mktestdocs

Run pytest against markdown files/docstrings.
Python
99
star
13

spacy-youtube-material

Here are the notebooks used during the spacy youtube series.
Jupyter Notebook
96
star
14

tuilwindcss

Very much like Tailwind, but for TUI frameworks in Textual.
CSS
70
star
15

tokenwiser

Bag of, not words, but tricks!
Python
67
star
16

skedulord

captures logs and makes cron more fun
Python
65
star
17

pytest-duration-insights

A mini dashboard to help find slow tests in pytest.
Python
57
star
18

arxiv-frontpage

My personal frontpage app
HTML
46
star
19

scikit-partial

Pipeline components that support partial_fit.
Python
35
star
20

scikit-fairness

this repo might get accepted
Python
29
star
21

spacy-report

Generate reports for spaCy models.
Python
28
star
22

brent

bayesian graphical modelling and a bit of do-calculus for discrete data.
Jupyter Notebook
27
star
23

icepickle

It's a cooler way to store simple linear models.
Python
26
star
24

koaning

21
star
25

justcharts

Just charts. Really.
HTML
21
star
26

scikit-prune

Prune your sklearn models
Python
19
star
27

thismonth.rocks

motivational website to do something special this month
CSS
18
star
28

sentimany

Just another sentiment wrapper.
Python
17
star
29

kadro

A friendly pandas wrapper with a more composable grammar support.
Jupyter Notebook
14
star
30

prodigy-tui

A textual TUI for Prodigy
CSS
13
star
31

calmcode-feedback

A repo to collect issues with calmcode.io
12
star
32

open_notebooks

Some notebooks that I've shared.
Jupyter Notebook
12
star
33

sentence-models

A different, but useful, textcat approach.
Python
11
star
34

paftdunk

Recommendin' all night to get lucky.
Jupyter Notebook
6
star
35

proglang-project

Python
6
star
36

scikit-teach

Active Learning Benchmarks
Jupyter Notebook
6
star
37

texttoolz

tools and tricks that are good to have around
5
star
38

makefile-demo

just a demo of a makefile in action
Makefile
5
star
39

gitlit

Streamlit App on Github Actions
Python
5
star
40

kolektor

Let's give this git-scraping a try.
Python
5
star
41

optimal-on-paper

broken in reality
Jupyter Notebook
5
star
42

liBERTy

A benchmark to compare BERT against sklearn.
Python
5
star
43

classycookie

cookiecutter to run standard text classifiers
Python
5
star
44

lazylines

Pipelines for JSONL files
Python
4
star
45

salary-bias

just another dangerous situation
Jupyter Notebook
4
star
46

dql101

A 101 repo with some code for openai Deep Q Learning
Jupyter Notebook
4
star
47

boondoc

lightweight Python API docs for markdown
Python
4
star
48

subspacy

BPEmb embeddings for spaCy
Python
4
star
49

akin

Some text similarity utilities
Python
4
star
50

calm-stats

Some GitScrapers
Python
3
star
51

calmcode-datasets

Just a Collection of Datasets
3
star
52

koaning-old.github.io

my personal blog
HTML
3
star
53

sushigo

An OpenAi-like environment for the sushi go card game.
Python
3
star
54

featherbed

Very lightweight text vectors via tf/idf + SVD
Python
3
star
55

onnx-demo

onnx seems interesting
Jupyter Notebook
3
star
56

benchmarks

Collection of benchmarks
Jupyter Notebook
3
star
57

baseliner

baseliner offers simple models that can act as a baseline to compare against
R
3
star
58

spacy-intent-example

intent prediction example on spaCy v3
Python
3
star
59

scikit-bloom

Bloom tricks for text pipelines in scikit-learn.
Python
3
star
60

github-slideshow

A robot powered training repository 🤖
HTML
2
star
61

wordlists

Just a bunch of potentially useful wordlists.
2
star
62

gli

my gleeful scripts for the cli
Python
2
star
63

labeltable

Things for bulk labelling.
Python
2
star
64

fusebox

Finetune-able Universal Sentence Encoder
Jupyter Notebook
2
star
65

subsette

A dash-boarding environment for datasette.
HTML
2
star
66

manyterms

Many terms for whatever purposes (weak labelling)
2
star
67

sentency

Lightweight SpaCy pipeline to detect sentences.
2
star
68

pydata-slovenia-talk

Bag of NLP Tricks!
Jupyter Notebook
2
star
69

helloworld

a helloworld package that should just work
R
2
star
70

uvnb

Have UV deal with all your Jupyter deps.
Jupyter Notebook
2
star
71

blackjack

a simple pytest demo
Python
2
star
72

demopkg

a demo pkg in R with github actions
R
2
star
73

lamarl

sushigo simulations on an aws backend
Python
2
star
74

wow-avatar-datasets

A place to host some parquet files.
2
star
75

python_data_intro

A beginner notebook for people who want to get started with python and data. Joy ensues!
Jupyter Notebook
2
star
76

buggingface

Let's see what we can learn from poking huggingface models.
1
star
77

digital-potato

HTML
1
star
78

gha-demo

Demo application for GitHub Actions tutorial.
Python
1
star
79

fastfood-bot

a rasa demo that can find you a fast food location
1
star
80

ecosystem-watcher

Just keeping an eye on the ecosystem.
Python
1
star
81

git-scrape-unravel

CLI to unravel git-scraped code.
1
star
82

scikit-prodigy

Helpers to leverage scikit-learn pipelines in Prodigy.
Python
1
star
83

skooba

less weak supervision
1
star
84

rasa-nlu-deploy

A demo that can run Rasa NLU in a container.
Python
1
star
85

datasette-parcoords

Parallel coordinates chart for datasette
JavaScript
1
star
86

nlu-cluster-demo

Upload your model file and talk to it!
Jupyter Notebook
1
star
87

tjek

tjek changes with the main branch
Python
1
star
88

katacoda-scenarios

Katacoda Scenarios
1
star
89

bulk-datasets

Helpers for the download command.
1
star
90

there-are-no-bad-labels

Repo for the PyData 2023 Workshop
Jupyter Notebook
1
star
91

tokenvolt

Populate an embedding cache quickly and get on with your day.
Python
1
star
92

rusty

Learning how to Rst
1
star
93

uvtrick

I really outdid myself with this hack.
Python
1
star
94

ollama-railway

Just to see if this might work out well.
Python
1
star