• Stars
    star
    350
  • Rank 117,032 (Top 3 %)
  • Language
    JavaScript
  • License
    MIT License
  • Created about 1 year ago
  • Updated 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A scientific instrument for investigating latent spaces

Latent Scope

PyPI version

Quickly embed, project, cluster and explore a dataset. This project is a new kind of workflow + tool for visualizing and exploring datasets through the lens of latent spaces. Example exploration

The power of machine learning models to encode unstructured data into high-dimensional embeddings is relatively under-explored. Retrieval Augmented Generation has taken off as a popular usecase for embeddings, but do you feel confident in your understanding of why certain data is being retrieved? Do you have a clear picture of what all is in your dataset? Latentscope is like a microscope that allows you to get a new perspective on what's happening to your data when it's embedded. You can try similarity search with different embeddings, peruse automatically labeled clusters and zoom in on individual data points all while keeping the context of your entire dataset.

Demo

This tool is meant to be run locally or on a trusted server to process data for viewing in the latent scope. You can see the result of the process in a read-only live demo:

The source of each demo dataset is documented in the notebooks linked below. Each demo was chosen to represent different scales of data as well as some common usecases.

Dadabase demo scopes

Quick Start

To get started, install the latent-scope module and run the server via the Command Line:

python -m venv venv
source venv/bin/activate
pip install latentscope
ls-init ~/local-scope-data --openai_key=XXX --mistral_key=YYY # optional api keys to enable API models 
ls-serve 

Then open your browser to http://localhost:5001 and start processing your first dataset!
Ingest Ingest

Once ingested, you will go through the following 6 steps: Embed, UMAP, Cluster, Label, Scope and Explore Embed UMAP Cluster Label Scope Scope

Each step focuses on the relevant choices to move you to the next step. For example choosing which embedding model you want to use to embed with, or the parameters for UMAP. It's very likely you may want to try several choices at each step, which is why the final step before "Explore" is to make a "scope". You can make multiple scopes, as seen in the dadabase example to explore your data through different lenses (i.e. OpenAI embeddings vs. Jina v2).

Python interface

You can also ingest data from a Pandas dataframe using the Python interface:

from latentscope import ls
df = pd.read_parquet("...")
ls.init("~/latent-scope-data") # you can also pass in openai_key="XXX", mistral_key="XXX" etc.)
ls.ingest("dadabase", df, text_column="joke")
ls.serve()

See these notebooks for detailed examples of using the Python interface to prepare and load data.

  • dvs-survey - A small test dataset of 700 rows to quickly illustrate the process. This notebook shows how you can do every step of the process with the Python interface.
  • dadabase - A more interesting (and funny) dataset of 50k rows. This notebook shows how you can preprocess a dataset, ingest it into latentscope and then use the web interface to complete the process.
  • dolly15k - Grab data from HuggingFace datasets and ingest into the process.
  • emotion - 400k rows of emotional tweets.

Command line quick start

When latent-scope is installed, it creates a suite of command line scripts that can be used to setup the scopes for exploring in the web application. The output of each step in the process is flat files stored in the data directory specified at init. These files are in standard formats that were designed to be ported into other pipelines or interfaces.

# like above, we make sure to install latent-scope
python -m venv venv
source venv/bin/activate
pip install latent-scope

# prepare some data
wget "https://storage.googleapis.com/fun-data/latent-scope/examples/dvs-survey/datavis-misunderstood.csv" > ~/Downloads/datavis-misunderstood.csv

ls-init "~/latent-scope-data"
# ls-ingest dataset_id csv_path
ls-ingest-csv "datavis-misunderstood" "~/Downloads/datavis-misunderstood.csv"
# get a list of model ids available (lists both embedding and chat models available)
ls-list-models
# ls-embed dataset_id text_column model_id prefix
ls-embed datavis-misunderstood "answer" transformers-intfloat___e5-small-v2 ""
# ls-umap dataset_id embedding_id n_neighbors min_dist
ls-umap datavis-misunderstood embedding-001 25 .1
# ls-cluster dataset_id umap_id samples min_samples
ls-cluster datavis-misunderstood umap-001 5 5
# ls-label dataset_id text_column cluster_id model_id context
ls-label datavis-misunderstood "answer" cluster-001 transformers-HuggingFaceH4___zephyr-7b-beta ""
# ls-scope  dataset_id embedding_id umap_id cluster_id cluster_labels_id label description
ls-scope datavis-misunderstood cluster-001-labels-001 "E5 demo" "E5 embeddings summarized by Zephyr 7B"
# start the server to explore your scope
ls-serve

Repository overview

This repository is currently meant to run locally, with a React frontend that communicates with a python server backend. We support several popular open source embedding models that can run locally as well as proprietary API embedding services. Adding new models and services should be quick and easy.

To learn more about customizing, extending and contributing see DEVELOPMENT.md

Design principles

This tool is meant to be a part of a larger process. Something that hopefully helps you see things in your data that you wouldn't otherwise have. That means it needs to be easy to get data in, and easily get useful data out.

  1. Flat files
  • All of the data that drives the app is stored in flat files. This is so that both final and intermediate outputs can easily be exported for other uses. It also makes it easy to see the status of any part of the process.
  1. Remember everything
  • This tool is intended to aid in research, the purpose is experimentation and exploration. I developed it because far too often I try a lot of things and then I forget what parameters lead me down a promising path in the first place. All choices you make in the process are recorded in metadata files along with the output of the process.
  1. It's all about the indices
  • We consider an input dataset the source of truth, a list of rows that can be indexed into. So all downstream operations, whether its embeddings, pointing to nearest neighbors or assigning data points to clusters, all use indices into the input dataset.

Command Line Scripts: Detailed description

If you want to use the CLI instead of the web UI you can use the following scripts.

The scripts should be run in order once you have an input.csv file in your folder. Alternatively the Setup page in the web UI will run these scripts via API calls to the server for you.
These scripts expect at the least a LATENT_SCOPE_DATA environment variable with a path to where you want to store your data. If you run ls-serve it will set the variable and put it in a .env file. You can add API keys to the .env file to enable usage of the various API services, see .env.example for the structure.

0. ingest

This script turns the input.csv into input.parquet and sets up the directories and meta.json which run the app.

# ls-ingest <dataset_name>
ls-ingest database-curated

1. embed

Take the text from the input and embed it. Default is to use BAAI/bge-small-en-v1.5 locally via HuggingFace transformers. API services are supported as well, see latentscope/models/embedding_models.json for model ids.

# you can get a list of models available with:
ls-list-models
# ls-embed <dataset_name> <text_column> <model_id>
ls-embed dadabase joke transformers-intfloat___e5-small-v2

2. umap

Map the embeddings from high-dimensional space to 2D with UMAP. Will generate a thumbnail of the scatterplot.

# ls-umap <dataset_name> <embedding_id> <neighbors> <min_dist>
ls-umap dadabase embedding-001 50 0.1

3. cluster

Cluster the UMAP points using HDBSCAN. This will label each point with a cluster label

# ls-cluster <dataset_name> <umap_id> <samples> <min-samples>
ls-cluster dadabase umap-001 5 3

4. label

We support auto-labeling clusters by summarizing them with an LLM. Supported models and APIs are listed in latentscope/models/chat_models.json. You can pass context that will be injected into the system prompt for your dataset.

# ls-label <dataset_id> <cluster_id> <chat_model_id> <context>
ls-label dadabase "joke" cluster-001 openai-gpt-3.5-turbo ""

5. scope

The scope command ties together each step of the process to create an explorable configuration. You can have several scopes to view different choices, for example using different embeddings or even different parameters for UMAP and clustering. Switching between scopes in the UI is instant.

# ls-scope  <dataset_id> <embedding_id> <umap_id> <cluster_id> <cluster_labels_id> <label> <description>
ls-scope datavis-misunderstood cluster-001-labels-001 "E5 demo" "E5 embeddings summarized by GPT3.5-Turbo"

6. serve

To start the web UI we run a small server. This also enables nearest neighbor similarity search and interactively querying subsets of the input data while exploring the scopes.

ls-serve ~/latent-scope-data

Dataset directory structure

Each dataset will have its own directory in data/ created when you ingest your CSV. All subsequent steps of setting up a dataset write their data and metadata to this directory. There are no databases in this tool, just flat files that are easy to copy and edit.

├── data/
|   ├── dataset1/
|   |   ├── input.parquet                           # from ingest.py, the dataset
|   |   ├── meta.json                               # from ingest.py, metadata for dataset, #rows, columns, text_column
|   |   ├── embeddings/
|   |   |   ├── embedding-001.h5                    # from embed.py, embedding vectors
|   |   |   ├── embedding-001.json                  # from embed.py, parameters used to embed
|   |   |   ├── embedding-002...                   
|   |   ├── umaps/
|   |   |   ├── umap-001.parquet                    # from umap.py, x,y coordinates
|   |   |   ├── umap-001.json                       # from umap.py, params used
|   |   |   ├── umap-001.png                        # from umap.py, thumbnail of plot
|   |   |   ├── umap-002....                        
|   |   ├── clusters/
|   |   |   ├── clusters-001.parquet                # from cluster.py, cluster indices
|   |   |   ├── clusters-001-labels-default.parquet # from cluster.py, default labels
|   |   |   ├── clusters-001-labels-001.parquet     # from label_clusters.py, LLM generated labels
|   |   |   ├── clusters-001.json                   # from cluster.py, params used
|   |   |   ├── clusters-001.png                    # from cluster.py, thumbnail of plot
|   |   |   ├── clusters-002...                     
|   |   ├── scopes/
|   |   |   ├── scopes-001.json                     # from scope.py, combination of embed, umap, clusters and label choice
|   |   |   ├── scopes-...                      
|   |   ├── tags/
|   |   |   ├── ❤️.indices                           # tagged by UI, powered by tags.py
|   |   |   ├── ...                                 # can have arbitrary named tags
|   |   ├── jobs/
|   |   |   ├──  8980️-12345...json                  # created when job is run via web UI

More Repositories

1

algovis

collection of projects and links about algorithm visualization
1,593
star
2

tributary

rapid prototyping with d3.js
JavaScript
597
star
3

blockbuilder

Create, fork and edit d3.js code snippets for use with bl.ocks.org right in the browser, no terminal required.
JavaScript
319
star
4

adventures_in_opencl

A tutorial series for learning OpenCL
C++
283
star
5

wwsd

working with spatial data - workshop materials
211
star
6

cmdrslog

chrome extension that takes automatic screenshots for a given url
JavaScript
110
star
7

bart

data pertaining to BART
JavaScript
79
star
8

Inlet

Bret Victor inspired slider & color picker plugin for CodeMirror
JavaScript
74
star
9

EnjaParticles

Particle System expirementation Sandbox: OpenCL/OpenGL on PC and Android OpenGL in C with JNI
C++
66
star
10

adventures_in_d3

tutorials and code samples for the d3.js library
JavaScript
55
star
11

intro-d3

workshop materials for introduction to d3.js
JavaScript
44
star
12

served

Create simple HTTP servers for any directory, no command line necessary
JavaScript
34
star
13

visxai-pattern-language

Towards a pattern language for visualizing AI
25
star
14

blockbuilder-search-index

download and process d3.js blocks for further indexing and visualization
CoffeeScript
24
star
15

blockbuilder-search

API endpoint and UI for blockbuilder search page
JavaScript
20
star
16

Bay-Area-d3

A collection of resources for learning d3.js
JavaScript
16
star
17

transit-datathon

notes and resources from the transit datathon 10/11+10/28 in SF
Python
11
star
18

d3mixtapes

collection of enjalot's d3 mixtape tutorials
9
star
19

dot-enter

d3.js mixtape of freestyle tributary performances
JavaScript
9
star
20

hwr

Handwriting recognition tools for CJK Languages
Python
9
star
21

BGERTPS

Blender modified to put the Real-Time Particle System library in the Game Engine
C
8
star
22

migrants

data from themigrantsfiles.com in a d3 friendly format
HTML
7
star
23

block-similarity

experimenting with searching blocks by similarity
CoffeeScript
7
star
24

pinyin

a visual explanation of the pinyin typing system
JavaScript
7
star
25

EnjaCL

C++ Wrappers for OpenCL with a focus on Particles
C++
7
star
26

dipsy

pure svg tooltips with d3.js
JavaScript
7
star
27

d34github

d3 talk @ github
6
star
28

tributary.io

real-time app powered by tributary and derby
CSS
6
star
29

d3.flow

making it easy to visualize velocity
JavaScript
5
star
30

enja.org

my personal website
JavaScript
5
star
31

tremor

data collection app for manuscripts
CoffeeScript
5
star
32

checkin

simple checkin app for meetup rsvps
JavaScript
5
star
33

dotfiles

bash, vim
Vim Script
4
star
34

clynect

PyOpenCL + libfreenect + PyOpenGL playground
Python
4
star
35

SlickTable

Tabletop + SlickGrid
JavaScript
4
star
36

purpleair-history

scripts for downloading and collating historical data from PurpleAir
JavaScript
4
star
37

earthquake

data for the swissnexSF earthquake resilience hackathon
4
star
38

pys3d

python stereo3d toolkit
Python
4
star
39

cypher

real-time code hip-hop
CoffeeScript
4
star
40

datacanvas

Data Canvas API explorer
JavaScript
4
star
41

homunculus

connect multiple Leap Motions with WebSockets
JavaScript
4
star
42

lildata

a browser based tool for exploring lil' data
3
star
43

awsapp

Amazon Web Services backend for Django
Python
3
star
44

sparcscraps

scraper for sparcsf.org menu
Python
3
star
45

latent-drafting

fetching reddit data and embedding it
Jupyter Notebook
3
star
46

enjalot.github.com

HTML
3
star
47

traffic

visualization exploring traffic on the Bay Bridge during July 2013 BART strike
JavaScript
2
star
48

blocksonblocks

an exploration of all d3.js blocks
HTML
2
star
49

mongo

An interactive explanation of the MongoDB Aggregation Pipeine
JavaScript
2
star
50

sparse-matrix-zoo

zoo of spare matrices
CoffeeScript
2
star
51

racer-animate

easily transition values that change in a Racer model
JavaScript
2
star
52

datainsightsf

JavaScript
2
star
53

police-departments

a machine-readable list of state and local police departments in the U.S.
2
star
54

derby-demo

simple demo app using derby
CoffeeScript
2
star
55

pyfers

python cyphers, i.e. examples of wrapping in python
Python
2
star
56

nodecsv

simple examples using node-csv
JavaScript
2
star
57

onair

tributary on air
JavaScript
2
star
58

tribulations

no pain no gain. removing the friction from prototyping
2
star
59

d3surveys

surveys about and for the d3.js community
CoffeeScript
2
star
60

enjapen

WiiMote python cocoa framework. Basic functionality
Python
2
star
61

derby-barchart

simple deom of using a d3 reusable chart with derby
JavaScript
1
star
62

taxitrips

playing around with taxi data
JavaScript
1
star
63

kijani

kijani backend
JavaScript
1
star
64

guttmacher-scrape

scraping tables from Guttmacher Institute
HTML
1
star
65

buzzdata_client

BuzzData Ruby Client Library
Ruby
1
star
66

d3-charts-book

Cookbook about building charts with D3.js using the reusable API
1
star
67

playditt

Simple Blender based game based on reddit.com
1
star
68

bitcatcher

node server + scripts for collecting and sharing bitcoin market data
1
star
69

masters-thesis

FSU DSC
Python
1
star
70

ofp

oakland food pantry registration app
JavaScript
1
star
71

gifclick

browser-based gif editing
JavaScript
1
star
72

talks

slides for talks
JavaScript
1
star
73

archon

interactive video tutorial system
JavaScript
1
star
74

react-hilbert-genome

A zoomable 2D map of the human genome
1
star