• Stars
    star
    303
  • Rank 137,655 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created about 2 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...

cc2dataset

pypi Try it on gitpod

Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...

Common crawl has 5M wat files. They provide links of the web. This simple tool allows you to process one warc in about 50s and get documents link along with the alt text.

It also runs deduplication against url+text in order to save on output space and speed up the process.

This makes it possible to do the first step of building a dataset like laion5B in 70k cpu core hours. (5*10^6*50/(3600)) That's $2.8k using aws EC2 (0.04$/core hour)

Intended usage

This tool produces a collection of link + caption. It is meant as the stage 1 of creating a dataset. It does deduplication and as minimal as possible filtering (does it look like an url / is the caption non empty).

This produces a large quantity of raw data that can then be further filtered by appropriate techniques. An example of stage 2 can be to estimate the similarity between (link, text) with a model such as CLIP. This may reduce the quantity of data by a factor of up to 100x depending on the chosen threshold.

What hardware to pick ?

CC is big and located at s3 us east 1, so it makes a lot of sense in term of network to use machines located in the same place.

cpu128-dy-c6i-32xlarge instances are advised. Spark stores the non duplicated first stage in local disk. They should be nvme drive for speed during deduplication. At this first stage, one wat takes about 20MB, so the total (over all workers) space must be more than 20MB times wat count. So for example for the whole CC, that means 100TB. So for example that can fit in 150 instances with 1TB nvme drive each. 150 instances of 128 cores is 19200 cores so the whole processing takes 2h. Less instances with bigger hard drives can work too. It's also a possibility to do the processing in multiple pieces if temporary disk space is an issue by specifying --multipart.

Document type

This tool support extracting several documents from CC:

  • image/text: about 300B after dedup
  • audio/text: about 2B after dedup
  • text doc : about 10B after dedup
  • video/text: about 2B after dedup

They can be selected with eg --document_type audio. You may experiment with more document kinds by running python example single_warc_example.py and exploring the resulting output.parquet.

Install

pip install cc2dataset

Python examples

Checkout these examples:

If you have a slurm cluster, refer to https://gist.github.com/rom1504/67ada3dedbecc113ae2dbdfd9c642d83 to start a spark cluster there.

API

This module exposes a single function cc2dataset which takes the same arguments as the command line tool:

  • output_path the output path, should probably start with s3://. The output will be written to this path sufixed by the date (required)
  • wat_index_count the number of wat index files to read, can be None for all. (default 1)
  • wat_count the number of wat files to read, can be None for all, will randomly subsample if present. (default 100)
  • master the spark master url. (default local)
  • num_cores the number of cores of each spark executor. (default 128)
  • mem_gb the memory of each spark executor. (default 256)
  • multipart runs the processing of the specified number of parts, merge at the end (default None)
  • shuffle randomly shuffle the output right before saving (default True)
  • resume the specific path of the output to resume (default None)
  • spark_builder a function that create a spark session, None will default to the built-in methods (default None)
  • document_type the kind of document to extract (default image)
  • source_cc_protocol get common crawl from http or s3 (default s3)

For development

Either locally, or in gitpod (do export PIP_USER=false there)

Setup a virtualenv:

python3 -m venv .env
source .env/bin/activate
pip install -e .

to run tests:

pip install -r requirements-test.txt

then

make lint
make test

You can use make black to reformat the code

python -m pytest -x -s -v tests -k "dummy" to run a specific test

Thanks

  • Vaishaal for providing the initial CC parsing code with efficient libraries
  • rvencu for optimizing the cc parsing code for laion5B on which the idea of this package is based on

More Repositories

1

img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
Python
3,610
star
2

clip-retrieval

Easily compute clip embeddings and build a clip retrieval system with them
Jupyter Notebook
2,361
star
3

laion-prepro

Get hundred of million of image+url from the crawling at home dataset and preprocess them
Python
202
star
4

image_embeddings

Using efficientnet to provide embeddings for retrieval
Jupyter Notebook
152
star
5

awesome-semantic-search

Semantic search with embeddings: index anything
136
star
6

MinecraftChat

Minecraft web based chat client
JavaScript
101
star
7

embedding-reader

Efficiently read embedding in streaming from any filesystem
Python
94
star
8

rbot

bot made with mineflayer which can do task
JavaScript
81
star
9

gpu-tester

gpu tester detects broken and slow gpus in a cluster
Python
64
star
10

dalle-service

Dalle service
JavaScript
50
star
11

any2dataset

Turn any collection of files into a dataset
Python
42
star
12

python-template

Simple python template
Python
40
star
13

audio2dataset

Easily turn large sets of audio urls to an audio dataset.
Python
20
star
14

sshd_android

How to access your android phone from anywhere using ssh
14
star
15

kaggle-fashion-dalle

Kaggle fashion dataset in dalle format
Jupyter Notebook
13
star
16

static-ondisk-kv

Simple and fast implementation of a static on disk key value store, in python
Python
9
star
17

all-clip

Load any clip model with a standardized interface
Python
9
star
18

slurm-tracking-bot

Simple slurm tracking bot to check usage
Python
8
star
19

web-minecraft-crafter

A web interface to minecraft crafter
JavaScript
8
star
20

minecraft-schematics-dataset

Minecraft schematics dataset
Jupyter Notebook
8
star
21

word_knn

Quickly find closest words using an efficient knn and word embeddings
Python
6
star
22

parse-wikitext

A simple wikitext parser in node.js
JavaScript
6
star
23

node-fernflower

Simple fernflower java decompiler wrapper
JavaScript
5
star
24

wct-datatables-net

Datatables.net as a webcomponent
JavaScript
5
star
25

node-corenlp-client

Simple corenlp client to the corenlp http server using request-promise
JavaScript
5
star
26

node-minecraft-proxies

Create minecraft proxies in node.js
JavaScript
5
star
27

flying-squid-schematic

Flying-squid plugin providing /listSchemas and /loadSchema commands.
JavaScript
4
star
28

minecraft-crafter

Tells you how to get any item by crafting in minecraft
JavaScript
4
star
29

flying-squid-irc

Make a bridge between flying-squid and an IRC channel.
JavaScript
4
star
30

TvSeriesOrganizer

Application targetting desktop and mobile to organize your tv series
QML
4
star
31

tensorflow_captcha_solver

Captcha solver based on https://medium.com/@ageitgey/how-to-break-a-captcha-system-in-15-minutes-with-machine-learning-dbebb035a710
Python
4
star
32

minecraft-schematic-crawler

Automatic minecraft schematic crawler for bots and ML
JavaScript
4
star
33

PersonalKnowledgeBase

Storing data about people.
4
star
34

adjective-animal

Generate an adjective-animal name !
JavaScript
4
star
35

auto-squid

Auto update and start flying-squid
Shell
4
star
36

rom1504.github.io

Personal website
3
star
37

minespy

Spy everybody with your minecraft proxy
JavaScript
3
star
38

ideas

Ideas
3
star
39

npm-safeguard

Download the most popular npm packages and check if they have accidentally published dot files
JavaScript
3
star
40

FaceRecognition

A program made using perl, bash, c++, opencv and libsvm which make it possible to automatically recognize faces.
Perl
3
star
41

imlb

Instant Messaging Logs Base : store and make available all your instant messages
3
star
42

schematic-to-world

Load a minecraft schematic into prismarine world
JavaScript
3
star
43

distributed-shuffle

A simple implementation of distributed shuffle, intended for learning
Python
2
star
44

AutoTathamet

Create Diablo2 bots with a powerful, stable, and high level JavaScript API.
JavaScript
2
star
45

minecraft-task-graph

Define a graph of tasks for minecraft
2
star
46

deepfashion_to_tfrecords

Convert deepfashion to tfrecords to learn multimodal models
Jupyter Notebook
2
star
47

rom1504

Profile readme
2
star
48

voxel-prismarine-world

An experimental prismarine-world visualizer using voxeljs.
JavaScript
2
star
49

mcpe-protocol-extractor

Extract MCPE protocol from pocketmine
JavaScript
2
star
50

BinaryTreeExample

This is an example for the GenericBinaryTree lib
C++
2
star
51

getSubtitle

Allow you to easily get tv show english subtitle from the command line from addic7ed.
Perl
2
star
52

fromconfig-mlflow

A fromconfig Launcher for MlFlow
Python
1
star
53

SignalList

A list container built around QList that emit signals when add,delete,.. methods are called.
C++
1
star
54

CorganoBot

@Corgano's minecraft bot
JavaScript
1
star
55

testing_repo

Just tests
1
star
56

autofaiss_rom1504

Automatically create Faiss knn indices with the most optimal similarity search parameters.
Python
1
star
57

MasonJar

NodeJS Minecraft implementation used on 8BitBlocks 2.0
JavaScript
1
star
58

ReVerbHttp

A simple http server to query ReVerb
Java
1
star
59

ChineseNumber

A chinese number converter in c++/Qt with unit tests
C++
1
star
60

pascal_interpreter

Make pascal graph call, pascal interpreter and compiler to c
OpenEdge ABL
1
star
61

DBpediaPerl

A very simple perl module which allow you to query the DBpedia sparql endpoint.
Perl
1
star
62

getQuotesSmooth

get quotes from smoothirc.net
JavaScript
1
star
63

node-facebook-import

Import facebook logs into a database.
JavaScript
1
star
64

rom1504.fr

My site
HTML
1
star
65

rcontact

Gestionnaire de contacts
C++
1
star
66

BotIrssi

Un bot irc proposant des jeux et autres fonctionnalités, plugin irssi
Perl
1
star
67

FaceDetect

A program that uses opencv, bash, perl, c++ and detect faces in pictures.
Perl
1
star
68

JsonConv

Convert Json to xml and sql
TeX
1
star
69

GenericBinaryTree

This is a generic binary tree implementation and a viewer of these Tree for Qt
C++
1
star
70

moteurPhysique

Gestion de plusieurs entités et de leur déplacement. On peut aussi construire une unité à partir du batiment.
C++
1
star
71

keras-square-function-estimator

A simple example on estimating the square function in keras
Python
1
star
72

RelExHttp

A simple http server to query RelEx
Java
1
star
73

faiss-java

Maven package for faiss
Java
1
star
74

FaceRecognitionInterface

A software that handle the whole process of tagging people on pictures.
C++
1
star
75

my-github-backups

Backup of my github projects
1
star
76

SimpleEditor

A simple editor made with Qt
C++
1
star
77

distributed-translator

Translate millions of captions to hundred of languages efficiently
Python
1
star
78

TvSeriesOrganizerPluginInterface

Allow plugin to interact on an episode
C++
1
star
79

GeneralQmlItems

Some useful general Qml Items
IDL
1
star
80

ngengine

A 2D/3D Game Engine (C++, OpenGL, Glm).
C++
1
star
81

TvSeriesAPI

A c++ Qt API providing series data from thetvdb and trakt
C++
1
star
82

FreebasePerl

A very simple perl module which allow you to query the freebase database.
Perl
1
star
83

client_irc

Client irc built with Qt (inspired by xchat)
C++
1
star
84

node-voxel-worldgen

A voxel world generator written in Rust, with bindings for JavaScript
Rust
1
star
85

node-raknet

UDP network library that follows the RakNet protocol for Node.js
JavaScript
1
star
86

freehex

An hex game
JavaScript
1
star
87

ecosysteme

Une sorte de simulation d'écosystème codé en c++ avec SDL
C++
1
star