• Stars
    star
    190
  • Rank 197,591 (Top 4 %)
  • Language
    Python
  • Created almost 3 years ago
  • Updated 12 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Get hundred of million of image+url from the crawling at home dataset and preprocess them

laion-prepro

Get billions of image+url from the laion datasets and preprocess them.

This repository can be run on

  • for laion400m one machine with 32GB of ram, 8TB of disk, 16 i7 core and a 1Gbps connection.
  • laion5B 10 machines similar to the laion400m one

What is laion ?

The laion project has for objective to use commoncrawl to retrieve billions of aligned image+text pairs. It is composed of a central server that track the progress of decentralized (run by anyone) workers that process small chunks of commoncrawl. Currently, 5B such pairs have already been retrieved. Read more about it at the laion 400M release post

What can be done with these dataset ?

Vision and language modeling has been taking off in 2021. Here are some pointers about what this kind of image + text datasets unlocks and why it seems really interesting:

  • 6 months ago OpenAI released 2 blogposts and papers clip and dall-e. Both model rely on a large amount of (text, image) pairs. They used an unreleased 400M pairs dataset.
    • CLIP is a model that computes how related are a text and an image. This makes it possible to build large text to image search, and it makes it possible to build that kind of crazy text to image art clip-art . They released a small and medium version of the model but no training code.
    • DALL-E is a model that directly generate images from texts. As can be seen from the blogpost, it achieves very impressive results that could have direct impacts on the world, for anything that need drawing and illustrations. OpenAI did not release any model, even through an API

Since then, several efforts have been organized to replicate DALL-E. People organized initially around this awesome dalle replication repository DALLE-pytorch with some nice results that can be seen in the readme. More recently as part of an huggingface events, new results have been achieved (see dalle mini report ) and an online demo is now available dalle-mini demo

The replication effort is still far from achieving the same performance as the original dalle, and it seems it's possible to go even further. Some people also want to make a better CLIP to produce even better generated art.

A large part of the results that can be achieved with such models is thanks to data. Large amount of data. Before laion 400M, the largest open dataset for (image, text) pairs are in the order of 10M (see DALLE-datasets ), which is enough to train okay models, but not enough to reach the best performance. Having a public dataset with hundred of millions of pairs will help a lot to build these image+text models.

Visualization of the dataset

Check the colab and the web demo

laion5B

laion5B and laion400m processing is overall the same, but laion5B being 10x, it required making everything distributed

Read more at laion5B/README.md

laion400m

See laion400m/README.md

More Repositories

1

img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
Python
3,192
star
2

clip-retrieval

Easily compute clip embeddings and build a clip retrieval system with them
Jupyter Notebook
2,089
star
3

cc2dataset

Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
Python
290
star
4

image_embeddings

Using efficientnet to provide embeddings for retrieval
Jupyter Notebook
139
star
5

awesome-semantic-search

Semantic search with embeddings: index anything
127
star
6

MinecraftChat

Minecraft web based chat client
JavaScript
101
star
7

embedding-reader

Efficiently read embedding in streaming from any filesystem
Python
84
star
8

rbot

bot made with mineflayer which can do task
JavaScript
81
star
9

gpu-tester

gpu tester detects broken and slow gpus in a cluster
Python
61
star
10

dalle-service

Dalle service
JavaScript
50
star
11

any2dataset

Turn any collection of files into a dataset
Python
41
star
12

python-template

Simple python template
Python
36
star
13

audio2dataset

Easily turn large sets of audio urls to an audio dataset.
Python
19
star
14

sshd_android

How to access your android phone from anywhere using ssh
14
star
15

kaggle-fashion-dalle

Kaggle fashion dataset in dalle format
Jupyter Notebook
13
star
16

slurm-tracking-bot

Simple slurm tracking bot to check usage
Python
9
star
17

static-ondisk-kv

Simple and fast implementation of a static on disk key value store, in python
Python
9
star
18

all-clip

Load any clip model with a standardized interface
Python
9
star
19

web-minecraft-crafter

A web interface to minecraft crafter
JavaScript
8
star
20

minecraft-schematics-dataset

Minecraft schematics dataset
Jupyter Notebook
8
star
21

word_knn

Quickly find closest words using an efficient knn and word embeddings
Python
6
star
22

parse-wikitext

A simple wikitext parser in node.js
JavaScript
6
star
23

node-fernflower

Simple fernflower java decompiler wrapper
JavaScript
5
star
24

wct-datatables-net

Datatables.net as a webcomponent
JavaScript
5
star
25

node-corenlp-client

Simple corenlp client to the corenlp http server using request-promise
JavaScript
5
star
26

node-minecraft-proxies

Create minecraft proxies in node.js
JavaScript
5
star
27

flying-squid-schematic

Flying-squid plugin providing /listSchemas and /loadSchema commands.
JavaScript
4
star
28

minecraft-crafter

Tells you how to get any item by crafting in minecraft
JavaScript
4
star
29

flying-squid-irc

Make a bridge between flying-squid and an IRC channel.
JavaScript
4
star
30

TvSeriesOrganizer

Application targetting desktop and mobile to organize your tv series
QML
4
star
31

minecraft-schematic-crawler

Automatic minecraft schematic crawler for bots and ML
JavaScript
4
star
32

tensorflow_captcha_solver

Captcha solver based on https://medium.com/@ageitgey/how-to-break-a-captcha-system-in-15-minutes-with-machine-learning-dbebb035a710
Python
4
star
33

PersonalKnowledgeBase

Storing data about people.
4
star
34

adjective-animal

Generate an adjective-animal name !
JavaScript
4
star
35

auto-squid

Auto update and start flying-squid
Shell
4
star
36

rom1504.github.io

Personal website
3
star
37

minespy

Spy everybody with your minecraft proxy
JavaScript
3
star
38

npm-safeguard

Download the most popular npm packages and check if they have accidentally published dot files
JavaScript
3
star
39

ideas

Ideas
3
star
40

FaceRecognition

A program made using perl, bash, c++, opencv and libsvm which make it possible to automatically recognize faces.
Perl
3
star
41

imlb

Instant Messaging Logs Base : store and make available all your instant messages
3
star
42

schematic-to-world

Load a minecraft schematic into prismarine world
JavaScript
3
star
43

distributed-shuffle

A simple implementation of distributed shuffle, intended for learning
Python
2
star
44

AutoTathamet

Create Diablo2 bots with a powerful, stable, and high level JavaScript API.
JavaScript
2
star
45

minecraft-task-graph

Define a graph of tasks for minecraft
2
star
46

deepfashion_to_tfrecords

Convert deepfashion to tfrecords to learn multimodal models
Jupyter Notebook
2
star
47

rom1504

Profile readme
2
star
48

voxel-prismarine-world

An experimental prismarine-world visualizer using voxeljs.
JavaScript
2
star
49

mcpe-protocol-extractor

Extract MCPE protocol from pocketmine
JavaScript
2
star
50

BinaryTreeExample

This is an example for the GenericBinaryTree lib
C++
2
star
51

getSubtitle

Allow you to easily get tv show english subtitle from the command line from addic7ed.
Perl
2
star
52

MasonJar

NodeJS Minecraft implementation used on 8BitBlocks 2.0
JavaScript
1
star
53

fromconfig-mlflow

A fromconfig Launcher for MlFlow
Python
1
star
54

SignalList

A list container built around QList that emit signals when add,delete,.. methods are called.
C++
1
star
55

GenericBinaryTree

This is a generic binary tree implementation and a viewer of these Tree for Qt
C++
1
star
56

CorganoBot

@Corgano's minecraft bot
JavaScript
1
star
57

testing_repo

Just tests
1
star
58

autofaiss_rom1504

Automatically create Faiss knn indices with the most optimal similarity search parameters.
Python
1
star
59

ReVerbHttp

A simple http server to query ReVerb
Java
1
star
60

ChineseNumber

A chinese number converter in c++/Qt with unit tests
C++
1
star
61

pascal_interpreter

Make pascal graph call, pascal interpreter and compiler to c
OpenEdge ABL
1
star
62

DBpediaPerl

A very simple perl module which allow you to query the DBpedia sparql endpoint.
Perl
1
star
63

getQuotesSmooth

get quotes from smoothirc.net
JavaScript
1
star
64

FreebasePerl

A very simple perl module which allow you to query the freebase database.
Perl
1
star
65

node-facebook-import

Import facebook logs into a database.
JavaScript
1
star
66

rom1504.fr

My site
HTML
1
star
67

rcontact

Gestionnaire de contacts
C++
1
star
68

BotIrssi

Un bot irc proposant des jeux et autres fonctionnalités, plugin irssi
Perl
1
star
69

FaceDetect

A program that uses opencv, bash, perl, c++ and detect faces in pictures.
Perl
1
star
70

JsonConv

Convert Json to xml and sql
TeX
1
star
71

moteurPhysique

Gestion de plusieurs entités et de leur déplacement. On peut aussi construire une unité à partir du batiment.
C++
1
star
72

keras-square-function-estimator

A simple example on estimating the square function in keras
Python
1
star
73

RelExHttp

A simple http server to query RelEx
Java
1
star
74

faiss-java

Maven package for faiss
Java
1
star
75

FaceRecognitionInterface

A software that handle the whole process of tagging people on pictures.
C++
1
star
76

my-github-backups

Backup of my github projects
1
star
77

SimpleEditor

A simple editor made with Qt
C++
1
star
78

distributed-translator

Translate millions of captions to hundred of languages efficiently
Python
1
star
79

TvSeriesOrganizerPluginInterface

Allow plugin to interact on an episode
C++
1
star
80

GeneralQmlItems

Some useful general Qml Items
IDL
1
star
81

ngengine

A 2D/3D Game Engine (C++, OpenGL, Glm).
C++
1
star
82

TvSeriesAPI

A c++ Qt API providing series data from thetvdb and trakt
C++
1
star
83

client_irc

Client irc built with Qt (inspired by xchat)
C++
1
star
84

node-raknet

UDP network library that follows the RakNet protocol for Node.js
JavaScript
1
star
85

freehex

An hex game
JavaScript
1
star
86

ecosysteme

Une sorte de simulation d'écosystème codé en c++ avec SDL
C++
1
star