• Stars
    star
    528
  • Rank 83,941 (Top 2 %)
  • Language
    JavaScript
  • License
    MIT License
  • Created over 7 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A WebGL viewer for UMAP or TSNE-clustered images

PixPlot

This repository contains code that can be used to visualize tens of thousands of images in a two-dimensional projection within which similar images are clustered together. The image analysis uses Tensorflow's Inception bindings, and the visualization layer uses a custom WebGL viewer.

See the change log for recent updates.

App preview

Installation & Dependencies

We maintain several platform-specific installation cookbooks online.

Broadly speaking, to install the Python dependencies, we recommend you install Anaconda and then create a conda environment with a Python 3.7 runtime:

conda create --name=3.7 python=3.7
source activate 3.7

Then you can install the dependencies by running:

bash
pip install https://github.com/yaledhlab/pix-plot/archive/master.zip

The website that PixPlot eventually creates requires a WebGL-enabled browser.

Quickstart

If you have a WebGL-enabled browser and a directory full of images to process, you can prepare the data for the viewer by installing the dependencies above then running:

pixplot --images "path/to/images/*.jpg"

To see the results of this process, you can start a web server by running:

# for python 3.x
python -m http.server 5000

# for python 2.x
python -m SimpleHTTPServer 5000

The visualization will then be available at http://localhost:5000/output.

Sample Data

To acquire some sample data with which to build a plot, feel free to use some data prepared by Yale's DHLab:

pip install image_datasets

Then in a Python script:

import image_datasets
image_datasets.oslomini.download()

The .download() command will make a directory named datasets in your current working directory. That datasets directory will contain a subdirectory named 'oslomini', which contains a directory of images and another directory with a CSV file of image metadata. Using that data, we can next build a plot:

pixplot --images "datasets/oslomini/images/*" --metadata "datasets/oslomini/metadata/metadata.csv"

Creating Massive Plots

If you need to plot more than 100,000 images but don't have an expensive graphics card with which to visualize huge WebGL displays, you might want to specify a smaller "cell_size" parameter when building your plot. The "cell_size" argument controls how large each image is in the atlas files; smaller values require fewer textures to be rendered, which decreases the GPU RAM required to view a plot:

pixplot --images "path/to/images/*.jpg" --cell_size 10

Controlling UMAP Layout

The UMAP algorithm is particularly sensitive to three hyperparemeters:

--min_dist: determines the minimum distance between points in the embedding
--n_neighbors: determines the tradeoff between local and global clusters
--metric: determines the distance metric to use when positioning points

UMAP's creator, Leland McInnes, has written up a helpful overview of these hyperparameters. To specify the value for one or more of these hyperparameters when building a plot, one may use the flags above, e.g.:

pixplot --images "path/to/images/*.jpg" --n_neighbors 2

Curating Automatic Hotspots

If installed and available, PixPlot uses Hierarchical density-based spatial clustering of applications with noise, a refinement of the earlier DBSCAN algorithm, to find hotspots in the visualization. You may be interested in consulting this explanation of how HDBSCAN works.

Tip: If you are using HDBSCAN and find that PixPlot creates too few (or only one) 'automatic hotspots', try lowering the --min_cluster_size from its default of 20. This often happens with smaller datasets (less than a few thousand.)

If HDBSCAN is not available, PixPlot will fall back to scikit-learn's implementation of KMeans.

Adding Metadata

If you have metadata associated with each of your images, you can pass in that metadata when running the data processing script. Doing so will allow the PixPlot viewer to display the metadata associated with an image when a user clicks on that image.

To specify the metadata for your image collection, you can add --metadata=path/to/metadata.csv to the command you use to call the processing script. For example, you might specify:

pixplot --images "path/to/images/*.jpg" --metadata "path/to/metadata.csv"

Metadata should be in a comma-separated value file, should contain one row for each input image, and should contain headers specifying the column order. Here is a sample metadata file:

filename category tags description permalink Year
bees.jpg yellow a|b|c bees' knees https://... 1776
cats.jpg dangerous b|c|d cats' pajamas https://... 1972

The following column labels are accepted:

Column Description
filename the filename of the image
category a categorical label for the image
tags a pipe-delimited list of categorical tags for the image
description a plaintext description of the image's contents
permalink a link to the image hosted on another domain
year a year timestamp for the image (should be an integer)
label a categorical label used for supervised UMAP projection
lat the latitudinal position of the image
lng the longitudinal position of the image

IIIF Images

If you would like to process images that are hosted on a IIIF server, you can specify a newline-delimited list of IIIF image manifests as the --images argument. For example, the following could be saved as manifest.txt:

https://manifests.britishart.yale.edu/manifest/40005
https://manifests.britishart.yale.edu/manifest/40006
https://manifests.britishart.yale.edu/manifest/40007
https://manifests.britishart.yale.edu/manifest/40008
https://manifests.britishart.yale.edu/manifest/40009

One could then specify these images as input by running pixplot --images manifest.txt --n_clusters 2

Demonstrations (Developed with PixPlot 2.0 codebase)

Link Image Count Collection Info Browse Images Download for PixPlot
NewsPlot: 1910-1912 24,026 George Grantham Bain Collection News in the 1910s Images, Metadata
Bildefelt i Oslo 31,097 oslobilder Advanced search, 1860-1924 Images, Metadata

Acknowledgements

The DHLab would like to thank Cyril Diagne and Nicolas Barradeau, lead developers of the spectacular Google Arts Experiments TSNE viewer, for generously sharing ideas on optimization techniques used in this viewer, and Lillianna Marie for naming this viewer PixPlot.

More Repositories

1

lab-workshops

Materials for workshops on text mining, machine learning, and data visualization
Jupyter Notebook
144
star
2

intertext

Detect and visualize text reuse
Python
115
star
3

flask-react-boilerplate

Simple boilerplate for a Flask backend and React client
JavaScript
98
star
4

neural-neighbors

A simple web application for browsing similar images
JavaScript
32
star
5

wordmap

Visualize large text collections with WebGL
JavaScript
24
star
6

lexis-nexis-wsk

Convenience wrappers for the Lexis Nexis WSK API
Python
18
star
7

pointgrid

Transform a 2D point distribution to a hex grid to avoid overplotting in data visualizations
Python
17
star
8

iiif-downloader

A simple utility for downloading images from IIIF servers
Python
13
star
9

humanities-data-mining

Materials for YData Course "Humanities Data Mining"
8
star
10

voynich

Analyzing the Voynich Manuscript with computer vision
Jupyter Notebook
7
star
11

shears

Extract image content from historical book scans
Python
7
star
12

vertices

Extract fixed-size vertex data from .obj and .jpg files
Python
6
star
13

ensemble-at-yale

Crowdsourcing the transcription of Yale playbills - http://bit.ly/ensemble-at-yale
CoffeeScript
6
star
14

dancing-with-data

Generating 3D Dance Sequences with Neural Networks
Jupyter Notebook
6
star
15

realtime-image-layout

Visualize images from a IIIF manifest using client-side mobilenet image vectors
JavaScript
6
star
16

image-segmentation

Utilities for image segmentation tasks
Python
5
star
17

vggface

@rcmalli's keras-vggface library updated to Tensorflow 2
Python
5
star
18

lexis-bulk-api

A simple Python wrapper around the Lexis Nexis Bulk API
Python
5
star
19

realtime-layout

Boilerplate for creating TSNE and UMAP layouts with JavaScript in realtime
JavaScript
5
star
20

ani-yun-wiya

ᎠᏂᏴᏫᏯ Theme for Omeka 2.x. Based on Interactive Mechanics’ Omega Starter Theme.
PHP
5
star
21

dhlab-site

dhlab.yale.edu
HTML
4
star
22

nhba

A digital archive of New Haven's architecture
JavaScript
4
star
23

omeka-plugin-Casify

Protect restricted Omeka routes with CAS
PHP
4
star
24

mtcnn

@ipazc's mtcnn library updated to Tensorflow 2
Python
3
star
25

stylegan2-helpers

Automates a bunch of steps to go from a folder of images to a trained network using stylegan2
Python
3
star
26

intertext-client

The client application for https://github.com/YaleDHLab/Intertext
JavaScript
3
star
27

image_datasets

Image datasets for computer vision projects in Python
Python
3
star
28

jekyll-working-group

Because everyone loves a good static file site
3
star
29

voices

Semi-private archives of user-submitted materials
Ruby
3
star
30

facenet

A packaged version of David Sandberg's Facenet implementation
Python
2
star
31

chirila

A database of Australian languages
HTML
2
star
32

minimal-jekyll-starter

Minimal boilerplate for building a Jekyll site with a custom theme
CSS
2
star
33

variant-viewer

Display line-level variants across print/manuscript editions
HTML
2
star
34

dh-rees

Digital Humanities and Russian & East European Studies at Yale
PHP
2
star
35

omeka-plugin-PaginateCollections

Omeka plugin that adds pagination to collections pages within the ᎠᏂᏴᏫᏯ Theme.
PHP
2
star
36

gathering-a-building

Tracing Yale University's campus architecture
JavaScript
2
star
37

three-boilerplate

Minimal boilerplate for three.js apps
JavaScript
1
star
38

bookworm-docker

Docker image for running BookwormDB
Dockerfile
1
star
39

TensorFlow-CUDA

Setting up NVIDIA CUDA-enabled TensorFlow on Ubuntu x64
1
star
40

passages-to-freedom

Mapping the journeys slaves took to freedom
HTML
1
star
41

2018-03-12-YUL

Python
1
star
42

scroll_viewer

Support for zoomified large images + cursor positioning
CSS
1
star
43

ensembleatyale-tools

Backend tools for processing theatrical data at Yale.
Shell
1
star
44

bookworm-pq

Tools for processing vendor data for use in Bookworm
Python
1
star
45

minimal-bookworm

Visualize term distributions over time
Python
1
star
46

daily-mongo-backups

Email yourself mongo db backups every day
Python
1
star
47

development-timelines

Chart GitHub Development Timelines
Python
1
star
48

vectorized-minhash

A fork of @bradhackinen's vminhash that's installable as a module
Python
1
star
49

dante-at-hand

Reading women reading Dante
PHP
1
star
50

trails

Visualizing massive datasets with WebGL
JavaScript
1
star