• Stars
    star
    1,181
  • Rank 39,604 (Top 0.8 %)
  • Language
    Python
  • License
    MIT License
  • Created about 2 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A large-scale text-to-image prompt gallery dataset based on Stable Diffusion

DiffusionDB

hugging license arxiv badge datasheet

DiffusionDB is the first large-scale text-to-image prompt dataset. It contains 14 million images generated by Stable Diffusion using prompts and hyperparameters specified by real users. The unprecedented scale and diversity of this human-actuated dataset provide exciting research opportunities in understanding the interplay between prompts and generative models, detecting deepfakes, and designing human-AI interaction tools to help users more easily use these models.

Get Started

DiffusionDB is available at 🤗 Hugging Face Datasets.

Two Subsets

DiffusionDB provides two subsets (DiffusionDB 2M and DiffusionDB Large) to support different needs.

Subset Num of Images Num of Unique Prompts Size Image Directory Metadata Table
DiffusionDB 2M 2M 1.5M 1.6TB images/ metadata.parquet
DiffusionDB Large 14M 1.8M 6.5TB diffusiondb-large-part-1/ diffusiondb-large-part-2/ metadata-large.parquet
Key Differences
  1. Two subsets have a similar number of unique prompts, but DiffusionDB Large has much more images. DiffusionDB Large is a superset of DiffusionDB 2M.
  2. Images in DiffusionDB 2M are stored in png format; images in DiffusionDB Large use a lossless webp format.

Dataset Structure

We use a modularized file structure to distribute DiffusionDB. The 2 million images in DiffusionDB 2M are split into 2,000 folders, where each folder contains 1,000 images and a JSON file that links these 1,000 images to their prompts and hyperparameters. Similarly, the 14 million images in DiffusionDB Large are split into 14,000 folders.

# DiffusionDB 2M
./
├── images
│   ├── part-000001
│   │   ├── 3bfcd9cf-26ea-4303-bbe1-b095853f5360.png
│   │   ├── 5f47c66c-51d4-4f2c-a872-a68518f44adb.png
│   │   ├── 66b428b9-55dc-4907-b116-55aaa887de30.png
│   │   ├── [...]
│   │   └── part-000001.json
│   ├── part-000002
│   ├── part-000003
│   ├── [...]
│   └── part-002000
└── metadata.parquet
# DiffusionDB Large
./
├── diffusiondb-large-part-1
│   ├── part-000001
│   │   ├── 0a8dc864-1616-4961-ac18-3fcdf76d3b08.webp
│   │   ├── 0a25cacb-5d91-4f27-b18a-bd423762f811.webp
│   │   ├── 0a52d584-4211-43a0-99ef-f5640ee2fc8c.webp
│   │   ├── [...]
│   │   └── part-000001.json
│   ├── part-000002
│   ├── part-000003
│   ├── [...]
│   └── part-010000
├── diffusiondb-large-part-2
│   ├── part-010001
│   │   ├── 0a68f671-3776-424c-91b6-c09a0dd6fc2d.webp
│   │   ├── 0a0756e9-1249-4fe2-a21a-12c43656c7a3.webp
│   │   ├── 0aa48f3d-f2d9-40a8-a800-c2c651ebba06.webp
│   │   ├── [...]
│   │   └── part-010001.json
│   ├── part-010002
│   ├── part-010003
│   ├── [...]
│   └── part-014000
└── metadata-large.parquet

These sub-folders have names part-0xxxxx, and each image has a unique name generated by UUID Version 4. The JSON file in a sub-folder has the same name as the sub-folder. Each image is a PNG file (DiffusionDB 2M) or a lossless WebP file (DiffusionDB Large). The JSON file contains key-value pairs mapping image filenames to their prompts and hyperparameters. For example, below is the image of f3501e05-aef7-4225-a9e9-f516527408ac.png and its key-value pair in part-000001.json.

{
  "f3501e05-aef7-4225-a9e9-f516527408ac.png": {
    "p": "geodesic landscape, john chamberlain, christopher balaskas, tadao ando, 4 k, ",
    "se": 38753269,
    "c": 12.0,
    "st": 50,
    "sa": "k_lms"
  },
}

The data fields are:

  • key: Unique image name
  • p: Prompt
  • se: Random seed
  • c: CFG Scale (guidance scale)
  • st: Steps
  • sa: Sampler

Dataset Metadata

To help you easily access prompts and other attributes of images without downloading all the Zip files, we include two metadata tables metadata.parquet and metadata-large.parquet for DiffusionDB 2M and DiffusionDB Large, respectively.

The shape of metadata.parquet is (2000000, 13) and the shape of metatable-large.parquet is (14000000, 13). Two tables share the same schema, and each row represents an image. We store these tables in the Parquet format because Parquet is column-based: you can efficiently query individual columns (e.g., prompts) without reading the entire table.

Below are three random rows from metadata.parquet.

image_name prompt part_id seed step cfg sampler width height user_name timestamp image_nsfw prompt_nsfw
0c46f719-1679-4c64-9ba9-f181e0eae811.png a small liquid sculpture, corvette, viscous, reflective, digital art 1050 2026845913 50 7 8 512 512 c2f288a2ba9df65c38386ffaaf7749106fed29311835b63d578405db9dbcafdb 2022-08-11 09:05:00+00:00 0.0845108 0.00383462
a00bdeaa-14eb-4f6c-a303-97732177eae9.png human sculpture of lanky tall alien on a romantic date at italian restaurant with smiling woman, nice restaurant, photography, bokeh 905 1183522603 50 10 8 512 768 df778e253e6d32168eb22279a9776b3cde107cc82da05517dd6d114724918651 2022-08-19 17:55:00+00:00 0.692934 0.109437
6e5024ce-65ed-47f3-b296-edb2813e3c5b.png portrait of barbaric spanish conquistador, symmetrical, by yoichi hatakenaka, studio ghibli and dan mumford 286 1713292358 50 7 8 512 640 1c2e93cfb1430adbd956be9c690705fe295cbee7d9ac12de1953ce5e76d89906 2022-08-12 03:26:00+00:00 0.0773138 0.0249675

Metadata Schema

metadata.parquet and metatable-large.parquet share the same schema.

Column Type Description
image_name string Image UUID filename.
prompt string The text prompt used to generate this image.
part_id uint16 Folder ID of this image.
seed uint32 Random seed used to generate this image.
step uint16 Step count (hyperparameter).
cfg float32 Guidance scale (hyperparameter).
sampler uint8 Sampler method (hyperparameter). Mapping: {1: "ddim", 2: "plms", 3: "k_euler", 4: "k_euler_ancestral", 5: "k_heun", 6: "k_dpm_2", 7: "k_dpm_2_ancestral", 8: "k_lms", 9: "others"}.
width uint16 Image width.
height uint16 Image height.
user_name string The unique discord ID's SHA256 hash of the user who generated this image. For example, the hash for xiaohk#3146 is e285b7ef63be99e9107cecd79b280bde602f17e0ca8363cb7a0889b67f0b5ed0. "deleted_account" refer to users who have deleted their accounts. None means the image has been deleted before we scrape it for the second time.
timestamp timestamp UTC Timestamp when this image was generated. None means the image has been deleted before we scrape it for the second time. Note that timestamp is not accurate for duplicate images that have the same prompt, hypareparameters, width, height.
image_nsfw float32 Likelihood of an image being NSFW. Scores are predicted by LAION's state-of-art NSFW detector (range from 0 to 1). A score of 2.0 means the image has already been flagged as NSFW and blurred by Stable Diffusion.
prompt_nsfw float32 Likelihood of a prompt being NSFW. Scores are predicted by the library Detoxicy. Each score represents the maximum of toxicity and sexual_explicit (range from 0 to 1).

Warning Although the Stable Diffusion model has an NSFW filter that automatically blurs user-generated NSFW images, this NSFW filter is not perfect—DiffusionDB still contains some NSFW images. Therefore, we compute and provide the NSFW scores for images and prompts using the state-of-the-art models. The distribution of these scores is shown below. Please decide an appropriate NSFW score threshold to filter out NSFW images before using DiffusionDB in your projects.

NSFW Score distributions.

Loading DiffusionDB

DiffusionDB is large (1.6TB or 6.5 TB)! However, with our modularized file structure, you can easily load a desirable number of images and their prompts and hyperparameters. In the example-loading.ipynb notebook, we demonstrate three methods to load a subset of DiffusionDB. Below is a short summary.

Method 1: Use Hugging Face Datasets Loader

You can use the Hugging Face Datasets library to easily load prompts and images from DiffusionDB. We pre-defined 16 DiffusionDB subsets (configurations) based on the number of instances. You can see all subsets in the Dataset Preview.

Note To use Datasets Loader, you need to install Pillow as well (pip install Pillow)

import numpy as np
from datasets import load_dataset

# Load the dataset with the `large_random_1k` subset
dataset = load_dataset('poloclub/diffusiondb', 'large_random_1k')

Method 2. Use a downloader script

This repo includes a Python downloader download.py that allows you to download and load DiffusionDB. You can use it from your command line. Below is an example of loading a subset of DiffusionDB.

Usage/Examples

The script is run using command-line arguments as follows:

  • -i --index - File to download or lower bound of a range of files if -r is also set.
  • -r --range - Upper bound of range of files to download if -i is set.
  • -o --output - Name of custom output directory. Defaults to the current directory if not set.
  • -z --unzip - Unzip the file/files after downloading
  • -l --large - Download from Diffusion DB Large. Defaults to Diffusion DB 2M.
Downloading a single file

The specific file to download is supplied as the number at the end of the file on HuggingFace. The script will automatically pad the number out and generate the URL.

python download.py -i 23
Downloading a range of files

The upper and lower bounds of the set of files to download are set by the -i and -r flags respectively.

python download.py -i 1 -r 2000

Note that this range will download the entire dataset. The script will ask you to confirm that you have 1.7Tb free at the download destination.

Downloading to a specific directory

The script will default to the location of the dataset's part .zip files at images/. If you wish to move the download location, you should move these files as well or use a symbolic link.

python download.py -i 1 -r 2000 -o /home/$USER/datahoarding/etc

Again, the script will automatically add the / between the directory and the file when it downloads.

Setting the files to unzip once they've been downloaded

The script is set to unzip the files after all files have downloaded as both can be lengthy processes in certain circumstances.

python download.py -i 1 -r 2000 -z

Method 3. Use metadata.parquet (Text Only)

If your task does not require images, then you can easily access all 2 million prompts and hyperparameters in the metadata.parquet table.

from urllib.request import urlretrieve
import pandas as pd

# Download the parquet table
table_url = f'https://huggingface.co/datasets/poloclub/diffusiondb/resolve/main/metadata.parquet'
urlretrieve(table_url, 'metadata.parquet')

# Read the table using Pandas
metadata_df = pd.read_parquet('metadata.parquet')

Dataset Creation

We collected all images from the official Stable Diffusion Discord server. Please read our research paper for details. The code is included in ./scripts/.

Data Removal

If you find any harmful images or prompts in DiffusionDB, you can use this Google Form to report them. Similarly, if you are a creator of an image included in this dataset, you can use the same form to let us know if you would like to remove your image from DiffusionDB. We will closely monitor this form and update DiffusionDB periodically.

Credits

DiffusionDB is created by Jay Wang, Evan Montoya, David Munechika, Alex Yang, Ben Hoover, Polo Chau.

Citation

@article{wangDiffusionDBLargescalePrompt2022,
  title = {{{DiffusionDB}}: {{A}} Large-Scale Prompt Gallery Dataset for Text-to-Image Generative Models},
  author = {Wang, Zijie J. and Montoya, Evan and Munechika, David and Yang, Haoyang and Hoover, Benjamin and Chau, Duen Horng},
  year = {2022},
  journal = {arXiv:2210.14896 [cs]},
  url = {https://arxiv.org/abs/2210.14896}
}

Licensing

The DiffusionDB dataset is available under the CC0 1.0 License. The Python code in this repository is available under the MIT License.

Contact

If you have any questions, feel free to open an issue or contact Jay Wang.

More Repositories

1

cnn-explainer

Learning Convolutional Neural Networks with Interactive Visualization.
JavaScript
7,942
star
2

transformer-explainer

Transformer Explained Visually: Learn How LLM Transformer Models Work with Interactive Visualization
JavaScript
2,466
star
3

ganlab

GAN Lab: An Interactive, Visual Experimentation Tool for Generative Adversarial Networks
JavaScript
1,389
star
4

wizmap

Explore and interpret large embeddings in your browser with interactive visualization! 📍
TypeScript
400
star
5

dodrio

Exploring attention weights in transformer-based models with linguistic knowledge.
Svelte
342
star
6

unitable

UniTable: Towards a Unified Table Foundation Model
Jupyter Notebook
330
star
7

awesome-grad-school

🎓 Advice and resources for thriving and surviving graduate school
Makefile
318
star
8

diffusion-explainer

Diffusion Explainer: Visual Explanation for Text-to-image Stable Diffusion
JavaScript
190
star
9

wordflow

Social and customizable AI writing assistant! ✍️
TypeScript
173
star
10

timbertrek

Explore and compare 1K+ accurate decision trees in your browser!
TypeScript
136
star
11

argo-graph-lite

Interactive Graph Visualization in Your Browser
JavaScript
110
star
12

jpeg-defense

SHIELD: Fast, Practical Defense and Vaccination for Deep Learning using JPEG Compression
Python
81
star
13

interactive-classification

Interactive Classification for Deep Learning Interpretation
JavaScript
76
star
14

ClickDiffusion

ClickDiffusion: Harnessing LLMs for Interactive Precise Image Editing
Python
64
star
15

supernova

Explore 160+ notebook visual analytics tools in your browser!
SCSS
61
star
16

argo-scholar

Literature Review Made Easy with Visualization
JavaScript
55
star
17

magic-crop

Crop your perfect headshot with AI!
JavaScript
53
star
18

webshap

JavaScript library to explain any machine learning models anywhere!
TypeScript
50
star
19

people-map

Visualization Tool for Mapping Out Researchers using Natural Language Processing
Python
50
star
20

mememo

A JavaScript library that brings vector search and RAG to your browser!
TypeScript
47
star
21

tsr-convstem

High-Performance Transformers for Table Structure Recognition Need Early Convolutions
Python
37
star
22

FairVis

FairVis: Visual Analytics for Discovering Intersectional Bias in Machine Learning
JavaScript
35
star
23

nova

Simple method to create notebook-ready visual analytics tools!
CSS
28
star
24

LLM-Attributor

LLM Attributor: Attribute LLM's Generated Text to Training Data
Jupyter Notebook
25
star
25

llm-self-defense

LLM Self Defense: By Self Examination, LLMs know they are being tricked
Python
25
star
26

bluff

Bluff: Interactively Deciphering Adversarial Attacks on Deep Neural Networks
Jupyter Notebook
22
star
27

visual-auditor

Interactive scalable auditing of model biases and vulnerabilities with interpretable mitigation
Jupyter Notebook
20
star
28

argo-graph

Cross-platform Interactive Large Graph Visualization tool using Web Technologies
JavaScript
19
star
29

gam-coach

Personal coach to help you obtain desired AI decisions!
JavaScript
17
star
30

robust-principles

Robust Principles: Architectural Design Principles for Adversarially Robust CNNs
Python
16
star
31

neuro-cartography

Scalable Automatic Visual Summarization of Concepts in Deep Neural Networks
Jupyter Notebook
15
star
32

Fine-tuning-LLMs

Finetune Llama 2 on Colab for free on your own data: step-by-step tutorial
Jupyter Notebook
13
star
33

CardiacAR

Mobile Augmented Reality for Cardiovascular Surgical Planning
Swift
12
star
34

visgrader

Automatic Grading for D3 Visualizations
Jupyter Notebook
10
star
35

revamp

Automated Simulations of Adversarial Attacks on Arbitrary Objects in Realistic Scenes
Jupyter Notebook
10
star
36

EnergyVis

JavaScript
8
star
37

telegam

TeleGam: Combining Visualization and Verbalization for Interpretable Machine Learning
JavaScript
7
star
38

wordflow-doc

AI Writing Assistant Google Doc Add-on
JavaScript
6
star
39

VisCUIT

JavaScript
6
star
40

RECAST

Svelte
5
star
41

NeuroMapper

JavaScript
5
star
42

detector-detective

DetectorDetective: Investigating the Effects of Adversarial Examples on Object Detectors
Jupyter Notebook
4
star
43

ConceptEvo

Jupyter Notebook
3
star
44

arcollab

ARCollab is a multi-user surgical planning tool in mobile AR aimed to enhance collaboration between surgeons and cardiologists.
Swift
3
star
45

skeletricks

Jupyter Notebook
2
star
46

diffusiondb-thumbnails

1.8 million image thumbnails sampled from DiffusionDB
2
star
47

argo-graph-share

A sample configuration of Argo Lite's Strapi sharing service
JavaScript
1
star
48

MisVis

JavaScript
1
star
49

wordflow-addon

Wordflow Google Doc Add-on Homepage
HTML
1
star
50

schedule-emails-on-mac

How to Schedule Emails on Mac: Write Emails in Mail and Send Them Later
AppleScript
1
star