• Stars
    star
    220
  • Rank 179,399 (Top 4 %)
  • Language
    Jupyter Notebook
  • Created over 7 years ago
  • Updated over 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Exploration of methods for coloring t-SNE.

Coloring t-SNE

t-SNE is great at capturing a combination of the local and global structure of a dataset in 2d or 3d. But when plotting points in 2d, there are often interesting patterns in the data that only come out as "texture" in the point cloud. When the plot is colored appropriately, these patterns can be made more clear.

%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.cluster import MiniBatchKMeans
from sklearn.decomposition import IncrementalPCA, FastICA
from sklearn.neighbors import NearestNeighbors
from scipy.spatial.distance import euclidean
import numpy as np

First we load some data: a 128 dimensional embedding output from a VAE, and a 2 dimensional representation of those vectors based on running t-SNE.

%time data128 = np.load('data128.npy')
print data128.shape
%time data2 = np.load('data2.npy')
print data2.shape
CPU times: user 528 µs, sys: 353 ms, total: 354 ms
Wall time: 485 ms
(358359, 128)
CPU times: user 442 µs, sys: 4.1 ms, total: 4.54 ms
Wall time: 4.74 ms
(358359, 2)

Raw Data

When we visualize the raw data itself the "texture" is clear, but not as clear as the different "islands" and clusters.

def plot_tsne(xy, colors=None, alpha=0.25, figsize=(6,6), s=0.5, cmap='hsv'):
    plt.figure(figsize=figsize, facecolor='white')
    plt.margins(0)
    plt.axis('off')
    fig = plt.scatter(xy[:,0], xy[:,1],
                c=colors, # set colors of markers
                cmap=cmap, # set color map of markers
                alpha=alpha, # set alpha of markers
                marker=',', # use smallest available marker (square)
                s=s, # set marker size. single pixel is 0.5 on retina, 1.0 otherwise
                lw=0, # don't use edges
                edgecolor='') # don't use edges
    # remove all axes and whitespace / borders
    fig.axes.get_xaxis().set_visible(False)
    fig.axes.get_yaxis().set_visible(False)
    plt.show()
    
plot_tsne(data2)

png

One way of pulling out a real feature from the data is to look at the nearest neighbors in 2d and see how far away they are on average in the original high dimensional space. This suggestion comes from Martin Wattenberg. First we compute the indices for all the nearest neighbors:

nns = NearestNeighbors(n_neighbors=10).fit(data2)
%time distances, indices = nns.kneighbors(data2)
CPU times: user 1.73 s, sys: 44.1 ms, total: 1.77 s
Wall time: 1.77 s

And then we compute the distances in high dimensional space, and normalize them between 0 and 1.

distances = []
for point, neighbor_indices in zip(data128, indices):
    neighbor_points = data128[neighbor_indices[1:]] # skip the first one, which should be itself
    cur_distances = np.sum([euclidean(point, neighbor) for neighbor in neighbor_points])
    distances.append(cur_distances)
distances = np.asarray(distances)
distances -= distances.min()
distances /= distances.max()

In this case the distances look sort of gaussian with a long tail. We clip the ends to draw out the details in the colors.

plt.hist(np.clip(distances, 0.2, 0.4), bins=50)
plt.show()

png

plot_tsne(data2, np.clip(distances, 0.2, 0.4), cmap='viridis')

png

3D t-SNE

One technique is to compute t-SNE in 3D and use the results as colors. This can take a long time to compute with large datasets. This is the technique that was used for the Infinite Drum Machine.

from bhtsne import tsne
data3 = tsne(data128, dimensions=3)
data3 -= np.min(data3, axis=0)
data3 /= np.max(data3, axis=0)
plot_tsne(data2, data3)

3D/24D PCA

Another approach is to use PCA, which is must faster but does not show as much structure in the data.

pca = IncrementalPCA(n_components=3)
%time pca_projection = pca.fit_transform(data128)
pca_projection -= np.min(pca_projection, axis=0)
pca_projection /= np.max(pca_projection, axis=0)
plot_tsne(data2, pca_projection)
CPU times: user 10 s, sys: 2.34 s, total: 12.4 s
Wall time: 9.36 s

png

Instead of using PCA to 3D, we can also do PCA to 24 dimensions, comparing the dimensions to the median of each, and using those comparisons as bits in a 24-bit color. This suggestion comes from Mario Klingemann This technique can work well with only 12 dimensions (4 bits per color). It doesn't make sense in a "continuous" space (normalizing and multiplying the shuffled bits by the basis directly, rather than testing against the median first). That just makes the colors all muddled, more similar to the 3D PCA.

def projection_to_colors(projection, bits_per_channel=8):
    basis = 2**np.arange(bits_per_channel)[::-1]
    basis = np.hstack([basis, basis, basis])
    shuffled = np.hstack([projection[:,0::3], projection[:,1::3], projection[:,2::3]])
    bits = (shuffled > np.median(shuffled, axis=0)) * basis
    # if we stacked into a 3d tensor we could do this a little more efficiently
    colors = np.vstack([bits[:,:(bits_per_channel)].sum(axis=1),
                        bits[:,(bits_per_channel):(2*bits_per_channel)].sum(axis=1),
                        bits[:,(2*bits_per_channel):(3*bits_per_channel)].sum(axis=1)]).astype(float).T
    return colors / (2**bits_per_channel - 1)
    
def pack_binary_pca(data, bits_per_channel=8):
    bits_per_color = 3 * bits_per_channel
    pca = IncrementalPCA(n_components=bits_per_color)
    pca_projection = pca.fit_transform(data)
    return projection_to_colors(pca_projection, bits_per_channel)

%time colors = pack_binary_pca(data128, 8)
plot_tsne(data2, colors)
CPU times: user 11.3 s, sys: 2.59 s, total: 13.9 s
Wall time: 10.6 s

png

3D/24D ICA

Another approach is to use ICA, which can be a little slower than PCA, but shows different features depending on the data.

ica = FastICA(n_components=3)
%time ica_projection = ica.fit_transform(data128)
ica_projection -= np.min(ica_projection, axis=0)
ica_projection /= np.max(ica_projection, axis=0)
plot_tsne(data2, ica_projection)
CPU times: user 31.3 s, sys: 932 ms, total: 32.3 s
Wall time: 15.2 s

png

We can also do ICA to 24 dimensions and pack it into colors. This might make less sense than PCA theoretically.

def pack_binary_ica(data, bits_per_channel=8):
    bits_per_color = 3 * bits_per_channel
    ica = FastICA(n_components=bits_per_color, max_iter=500)
    ica_projection = ica.fit_transform(data)
    return projection_to_colors(ica_projection, bits_per_channel)

%time colors = pack_binary_ica(data128, 8)
plot_tsne(data2, colors)
CPU times: user 2min 15s, sys: 7.01 s, total: 2min 22s
Wall time: 1min 43s

png

K-Means

Another approach that shows up in the LargeVis Paper is to compute K-Means on the high dimensional data and then use those labels as color indices. We can try with 8, 30, and 128 cluster K-Means.

kmeans = MiniBatchKMeans(n_clusters=8)
%time labels = kmeans.fit_predict(data128)
plot_tsne(data2, labels)
CPU times: user 1.1 s, sys: 12.5 ms, total: 1.12 s
Wall time: 1.12 s

png

kmeans = MiniBatchKMeans(n_clusters=30)
%time labels = kmeans.fit_predict(data128)
plot_tsne(data2, labels)
CPU times: user 3.98 s, sys: 25.4 ms, total: 4.01 s
Wall time: 4.02 s

png

kmeans = MiniBatchKMeans(n_clusters=128)
%time labels = kmeans.fit_predict(data128)
plot_tsne(data2, labels)
CPU times: user 10.5 s, sys: 50.3 ms, total: 10.6 s
Wall time: 10.6 s

png

Some of the "boundaries" between colors seem fairly arbitrary, but K-Means has a nice property of allowing us to identify the centers of these color regions if we want to provide exemplars.

neighbors = NearestNeighbors(n_neighbors=1, metric='euclidean')
%time neighbors.fit(data128)
%time distances, indices = neighbors.kneighbors(kmeans.cluster_centers_)

plt.figure(figsize=(6,6), facecolor='white')
plt.margins(0)
plt.axis('off')
fig = plt.scatter(data2[:,0], data2[:,1], alpha=0.5, marker=',', s=0.5, lw=0, edgecolor='', c=labels, cmap='hsv')
plt.scatter(data2[indices,0], data2[indices,1], marker='.', s=250, c=(0,0,0))
plt.scatter(data2[indices,0], data2[indices,1], marker='.', s=100, c=labels[indices], cmap='hsv')
plt.show()
CPU times: user 6.64 s, sys: 28.7 ms, total: 6.66 s
Wall time: 6.67 s
CPU times: user 13.9 s, sys: 60.2 ms, total: 14 s
Wall time: 14.1 s

png

sklearn.neighbors can be slow, but the mrpt library is much faster.

import mrpt
data128f32 = data128.astype(np.float32)
%time nn = mrpt.MRPTIndex(data128f32, depth=5, n_trees=100)
%time nn.build()
def kneighbors(nn, queries, k, votes_required=4):
    return np.asarray([nn.ann(query, k, votes_required=votes_required) for query in queries])
%time indices = kneighbors(nn, data128f32[:100], 10)

argmax

Another technique, arguably the most evocative, is to use the argmax of each high dimensional vector. The motivation for using argmax is that high dimensional data is so sparse that "nearby" points should have a similar ordering of their dimensions: if you sorted the dimensions of two nearby points, the difference should be small. This means that their argmax (the largest dimensions) should probably be shared. If we do this without any modification to the high dimensional data, we get a fairly homogenous plot:

plot_tsne(data2, np.argmax(data128, axis=1))

png

This is because a few dimensions dominate the argmax.

plt.hist(np.argmax(data128, axis=1), bins=128)
plt.show()

png

If we standardize each dimension then there is a more even distribution of possible argmax values, and therefore more even distribution of colors.

def standardize(data):
    std = np.copy(data)
    std -= std.mean(axis=0)
    std /= std.std(axis=0)
    return std
data128_standardized = standardize(data128)
plt.hist(np.argmax(data128_standardized, axis=1), bins=128)
plt.show()

png

plot_tsne(data2, np.argmax(data128_standardized, axis=1))

png

In this case the high dimensional data has both negative and positive components, so it might make more sense to take the absolute value before computing the argmax. In this case, it makes things visually "messier" with too many overlapping colors.

plot_tsne(data2, np.argmax(np.abs(data128_standardized), axis=1))

png

PCA + argmax

Because some of the dimensions are correlated with each other in this case, it might make sense to do PCA before taking the argmax. Again if we take the argmax without standardizing the high dimensional data, a few colors dominate.

pca = IncrementalPCA(n_components=30)
%time pca_projection = pca.fit_transform(data128)
labels = np.argmax(pca_projection, axis=1)
plot_tsne(data2, labels)
CPU times: user 10.9 s, sys: 2.21 s, total: 13.1 s
Wall time: 9.44 s

png

Here we can see the distribution of argmax results are concentrated toward the first dimensions of the PCA projection.

plt.hist(np.argmax(pca_projection, axis=1), bins=30)
plt.show()

png

Once we standardize it we see a more even distribution.

projection_standardized = standardize(pca_projection)
plt.hist(np.argmax(projection_standardized, axis=1), bins=30)
plt.show()

png

And now we can take the standardized argmax, and the argmax of the absolute values of the standardized data.

plot_tsne(data2, np.argmax(projection_standardized, axis=1))

png

plot_tsne(data2, np.argmax(np.abs(projection_standardized), axis=1))

png

I prefer the non-absolute value argmax in this case. Now we can run the whole process again for different number of output components from PCA. Here it is for 16 and 128 dimensions.

pca = IncrementalPCA(n_components=16)
%time pca_projection = pca.fit_transform(data128)
labels = np.argmax(standardize(pca_projection), axis=1)
plot_tsne(data2, labels)
CPU times: user 10.4 s, sys: 2.14 s, total: 12.6 s
Wall time: 9.18 s

png

pca = IncrementalPCA(n_components=128)
%time pca_projection = pca.fit_transform(data128)
labels = np.argmax(standardize(pca_projection), axis=1)
plot_tsne(data2, labels)
CPU times: user 13.3 s, sys: 2.58 s, total: 15.9 s
Wall time: 10.9 s

png

It should be possible to "tune" the amount of color variation by an element-wise multiplication between each PCA projected vector and a vector with some "falloff" that gives more weight to the earlier dimensions and less weight to the final dimensions.

ICA + argmax

We can try the same technique, but using ICA instead of PCA. Here for 8, 30, and 128 dimensions.

ica = FastICA(n_components=8, max_iter=500)
%time ica_projection = ica.fit_transform(data128)
labels = np.argmax(standardize(ica_projection), axis=1)
plot_tsne(data2, labels)
CPU times: user 35.1 s, sys: 2 s, total: 37.1 s
Wall time: 19.7 s

png

ica = FastICA(n_components=30, max_iter=500)
%time ica_projection = ica.fit_transform(data128)
labels = np.argmax(standardize(ica_projection), axis=1)
plot_tsne(data2, labels)
CPU times: user 51.6 s, sys: 2.05 s, total: 53.7 s
Wall time: 31.6 s

png

ica = FastICA(n_components=128, max_iter=500)
%time ica_projection = ica.fit_transform(data128)
labels = np.argmax(standardize(ica_projection), axis=1)
plot_tsne(data2, labels)
CPU times: user 4min 59s, sys: 18.3 s, total: 5min 17s
Wall time: 3min

png

ICA/PCA + K-Means

Finally, we can try computing K-Means on top of the dimensionality reduced vectors.

kmeans = MiniBatchKMeans(n_clusters=128)
%time labels = kmeans.fit_predict(pca_projection)
plot_tsne(data2, labels)
CPU times: user 10.3 s, sys: 96 ms, total: 10.4 s
Wall time: 10.4 s

png

kmeans = MiniBatchKMeans(n_clusters=128)
%time labels = kmeans.fit_predict(ica_projection)
plot_tsne(data2, labels)
CPU times: user 8.41 s, sys: 160 ms, total: 8.57 s
Wall time: 8.57 s

png

More Repositories

1

FreeWifi

How to get free wifi.
Python
2,870
star
2

ofxFaceTracker

CLM face tracking addon for openFrameworks based on Jason Saragih's FaceTracker.
C++
1,383
star
3

FaceTracker

Real time deformable face tracking in C++ with OpenCV 3.
C++
996
star
4

ofxCv

Alternative approach to interfacing with OpenCv from openFrameworks.
C++
655
star
5

AudioNotebooks

Collection of notebooks and scripts related to audio processing and machine learning.
Jupyter Notebook
422
star
6

Parametric-t-SNE

Running parametric t-SNE by Laurens Van Der Maaten with Octave and oct2py.
Jupyter Notebook
264
star
7

AppropriatingNewTechnologies

A half-semester class at ITP.
C++
252
star
8

cv-examples

A collection of computer vision examples in JavaScript for the browser.
JavaScript
237
star
9

ethereum-nft-activity

Estimate the total emissions for popular CryptoArt platforms.
Jupyter Notebook
183
star
10

ml-notebook

Dockerfile for multiple machine learning tools.
Shell
162
star
11

ofxFft

FFT addon for openFrameworks that wrapps FFTW and KissFFT.
C++
139
star
12

SmileCNN

Smile detection with a deep convolutional neural net, with Keras.
Jupyter Notebook
138
star
13

ofxCcv

libccv addon for openFrameworks
C
123
star
14

ofxEdsdk

Interfacing with Canon cameras from openFrameworks for OSX. An alternative to ofxCanon and CanonCameraWrapper.
C++
111
star
15

nvidia-co2

Adds gCO2eq emissions to nvidia-smi.
Python
110
star
16

OpenFit

Open source jeans.
Processing
109
star
17

ml-examples

Examples of machine learning, with an emphasis on deep learning.
Jupyter Notebook
109
star
18

CloudToGrid

Example of converting a 2d point cloud to a 2d grid via the assignment problem.
Jupyter Notebook
96
star
19

python-utils

Disorganized collection of useful functions for working with audio and images, especially in the context of machine learning.
Python
93
star
20

LightLeaks

An immersive installation built from a pile of mirror balls and a few projectors.
Jupyter Notebook
92
star
21

openFrameworksDemos

Collection of assorted demos and examples for openFrameworks that don't fit anywhere else.
C++
92
star
22

Makerbot

Experiments and projects while in residence at Makerbot Industries.
C++
91
star
23

gpt-2-poetry

Generating poetry with GPT-2.
Jupyter Notebook
89
star
24

ofxDmx

DMX Pro wrapper for openFrameworks
C++
83
star
25

ofxBlackmagic

Simplified and optimized Black Magic DeckLink SDK grabber.
C++
79
star
26

ethereum-emissions

Estimating the daily energy usage for Ethereum.
Jupyter Notebook
75
star
27

ofxBlur

A very fast, configurable GPU blur addon that can also simulate bloom and different kernel shapes.
C++
64
star
28

ofxAssignment

A tool for matching point clouds or other kinds of data. Useful for making grids from point clouds.
C++
62
star
29

ExhaustingACrowd

JavaScript
53
star
30

SharingFaces

C++
48
star
31

COVIDPause

Chrome extension for pausing all mentions of COVID-19.
JavaScript
45
star
32

SharingInterviews

A collection of interviews about creators sharing work, with an emphasis on open source, media art, and digital communities.
44
star
33

i2i-realtime

Python
44
star
34

ofxFaceShift

Network-based addon for interfacing with FaceShift Studio from openFrameworks.
C++
39
star
35

KernelizedSorting

Mirror of Kernelized Sorting code by Novi Quadrianto.
Python
39
star
36

BlindSelfPortrait

An interactive installation that guides your hand to draw a self portrait.
Jupyter Notebook
38
star
37

ImageRearranger

Rearrange mosaics by similarity.
Jupyter Notebook
37
star
38

ofxCameraFilter

A one-shot effect for simulating: vignetting, lens distortion, chromatic aberration, blur/bloom, and noise grain.
C++
36
star
39

ofxTesseract

tesseract-ocr wrapper for openFrameworks
C++
33
star
40

arxiv-visual-summary

Tool for extracting a visual summary of new papers uploaded to ArXiv.
HTML
33
star
41

EmbeddingScripts

Collection of scripts for visualizing high dimensional data with scikit-learn and bh_tsne
Python
32
star
42

ofxFaceTracker-iOS

Example of using ofxFaceTracker on iOS.
Objective-C++
31
star
43

ofxTiming

Timing utilities for handling recurring events, fading, framerate counting.
C++
31
star
44

ofxLibdc

Open Frameworks wrapper for libdc1394.
C
30
star
45

ofxVirtualKinect

Creates a virtual kinect depth image from an arbitrary position and orientation, using ofxKinect.
C++
30
star
46

mueller-unredacter

Generating text completions based on the Mueller report
HTML
28
star
47

whopaysartists

EJS
27
star
48

ofxAudioDecoder

An openFrameworks addon for m4a/aac, mp3, wav, and other file loading.
C++
27
star
49

ofxAutostereogram

Small library for producing autostereograms, as popularized by the "Magic Eye" book series.
C++
27
star
50

covid-mobility-data

Simple script for digitizing the plots in .pdf files from Google's "Community Mobile Reports".
Python
27
star
51

3dsav

Code for 3d Sensing and Visualization class.
C++
25
star
52

ofxZxing

openFrameworks wrapper of ZXing for detecting and decoding QR Codes in real time.
C++
23
star
53

structured-light

Automatically exported from code.google.com/p/structured-light
C++
21
star
54

Messages

Endless Bytebeat synthesis. Generative shader code for audio and visuals.
C++
21
star
55

Eyeshine

C
21
star
56

SoundParts

Collection of classes for working with sound in C++.
C++
21
star
57

ofxLaunchpad

Interface for Novation Launchpad MIDI controller.
C++
19
star
58

MultiscaleTuring

An implementation of multiscale turing patterns with openFrameworks and OpenCV.
C++
18
star
59

reverse-tunnel

Make a reverse tunnel from OSX to a Linux machine.
Python
18
star
60

facepp

Face tracking and augmentation: a collaboration between Zach Lieberman, Daito Manabe, and Kyle McDonald.
C++
18
star
61

ofxPathfinder

Small and efficient A* pathfinding addon for openFrameworks, supporting variable terrain costs.
C++
17
star
62

prnetjs

Port of PRNet face analysis tool to JavaScript using TensorFlow.js
HTML
17
star
63

socialroulette.net

PHP
16
star
64

ofxMetaballs

Metaballs implementations for openFrameworks using marching cubes and marching tetrahedrons.
C++
16
star
65

sakoku-explorer

Explore your data from Facebook and Google.
Svelte
16
star
66

FisheyeToEquirectangular

Scripts for converting pairs of Hikvision fisheye videos to equirectangular videos.
Python
15
star
67

ofxHeadPoseEstimator

openFrameworks example using ofxKinect to demonstrate research from Gabriele Fanelli.
C++
15
star
68

Transcranial

Interactive dance performance with Klaus Obermaier and Daito Manabe.
Max
14
star
69

ScreenLab

ScreenLab 0x02 residency with Joanie Lemercier.
C++
14
star
70

ableton-web-sync

JavaScript
14
star
71

prores-raw-export

Objective-C
13
star
72

ofxBvh

openFrameworks addon for parsing, rendering, manipulating and saving BVH files.
C++
13
star
73

ofxConnexion

Wraps 3dConnexionClient for openFrameworks on OSX
C++
13
star
74

ofxCurvesTool

An interface for controlling a 1D cubic spline, continuously evaluated and stored in a lookup table.
C++
13
star
75

DohaInstallation

Multi-monitor interactive installation for Wafaa Bilal's 3rdi.
C++
12
star
76

DigitalInteraction

Code related to the FITC 2013 "Digital Interaction" workshop with Daito Manabe.
C++
11
star
77

Barneys

Work on a custom 4m sculpture designed to scatter light in every direction.
JavaScript
11
star
78

BaristaBot

BaristaBot draws your portrait in your latte.
C++
11
star
79

UVCExample

Example of using libuvc with openFrameworks on Mac.
C
10
star
80

HowWeActTogether-Tracking

Facetracking for How We Act Together.
JavaScript
10
star
81

t-SNEPreprocessingComparison

Comparison of two techniques for pre-processing data for t-SNE (PCA and convolutional autoencoder).
Jupyter Notebook
10
star
82

tSNESearch

Example of loading t-SNE organized sounds into openFrameworks.
C++
9
star
83

Serendipity

A visualization: every second a few people hit "play" on the same Spotify track.
JavaScript
9
star
84

Roseheading

Endless glitch facets of a "fractured, frozen" mosaic, our data in the cloud.
Java
9
star
85

TheJanusMachine

C++
8
star
86

PhotoMosaic

PhotoMosaic app that loads from a folder of images and regularly transitions.
C++
8
star
87

3dCalibration

Tools for calibrating 3d cameras to 2d cameras using openFrameworks.
C++
8
star
88

AndyWarholMachine

Interactive installation for "Andy Warhol: Manufactured" at the Anchorage Museum.
C++
8
star
89

ofxVCGLib

VCG for OF: based on work from Akira-Hayasaka, wrapping the VCG library for OF friendliness
C
8
star
90

ofxVicon

Wrapper for interfacing to the Vicon motion capture system with openFrameworks.
C++
8
star
91

AppleStore

PHP
7
star
92

GoingPublic

Tweets anything sent via direct message that is prefixed with a ~ (tilde).
PHP
7
star
93

Highsight

Cam on wire.
C++
7
star
94

CameraHacking

Processing sketches for an analog+digital camera hacking workshop with Chris Woebken.
Java
7
star
95

facework

Facework
TypeScript
6
star
96

express-photobooth

Example of a basic photobooth with Express, getUserMedia, and canvas-to-blob.
JavaScript
6
star
97

SubdivisionOfRoam

Installation for Chris Milk, in collaboration with Golan Levin and Emily Gobeille.
C++
6
star
98

HappyThings

A background app that automatically posts a screenshot every time you smile.
PHP
6
star
99

kylemcdonald.net

Repository for my website: things that can't be hosted elsewhere.
HTML
6
star
100

everyautocomplete

Get every autocomplete result.
HTML
6
star