DataComp
[ Paper ] [ Website ] [ Blog ]
Welcome to our competition. This repository contains the participant tooling necessary to download data from our pool, train CLIP models, evaluate them on downstream tasks and submit to our leaderboard.
Overview
DataComp is a competition about designing datasets for pre-training CLIP models. Instead of iterating on model design and hyperparameter tuning like in traditional benchmarks, in DataComp your task is to curate a multimodal pre-training dataset with image-text pairs that yields high accuracy on downstream tasks. Model architecture and hyperparameters are fixed allowing participants to innovate on the dataset design. As part of the benchmark, we provide a large collection of uncurated image-text pairs, crawled from the public internet.
Our benchmark offers two tracks: one where participants must use only samples from the pools we provide (filtering
), and another where participants can use external data, including samples from our pool (Bring your own data, BYOD
).
DataComp is structured to accommodate participants with diverse levels of computational resources: each track is broken down into four scales, with varying amounts of compute requirements.
An overview of our benchmark and participant workflow can be found below. For more information, check out our paper and website.
Installing dependencies
Run:
bash create_env.sh
To activate the environment:
conda activate datacomp
If using cloud storage services (e.g. AWS S3), you'll need to install additional dependencies (e.g. pip install 'cloudpathlib[s3]'
).
Downloading CommonPool
To download, run the following command, replacing $scale
with the competition scale (i.e. small
, medium
, large
or xlarge
) and $data_dir
with the output directory where you want the data to be stored.
python download_upstream.py --scale $scale --data_dir $data_dir
There are four scales in our competition:
small
: 12.8M pool size, 12.8M examples seenmedium
: 128M pool size, 128M examples seenlarge
: 1.28B pool size, 1.28B examples seenxlarge
: 12.8B pool size, 12.8B examples seen
The script will create two directories inside $data_dir
: metadata
and shards
.
Along with the images and captions, this script will also download metadata, including .parquet
files that contain the image urls, captions, and other potentially useful information such as the similarities between the images and captions given by trained OpenAI CLIP models.
If the flag --download_npz
is used, the script will also download the .npz
files with features extracted by the trained OpenAI CLIP models for each sample.
We download the image data using img2dataset, which stores it as .tar
shards with the images and captions to be consumed by webdataset.
Once the download finishes, the data will be available at $data_dir/shards
.
To download only metadata, use the --skip_shards
flag.
The disk requirements for each scale are shown below.
metadata (parquets) | metadata (npzs) | data (tars) | |
---|---|---|---|
small scale |
3 GB | 75GB | 450 GB |
medium scale |
30 GB | 750GB | 4.5 TB |
large scale |
300 GB | 7.5TB | 45 TB |
xlarge scale |
3 TB | 75TB | 450 TB |
Downloading DataComp-1B
The script download_upstream.py
can be used to download the DataComp-1B
dataset that we release as our best performing subset of the xlarge
pool. To download this, use the following command:
python download_upstream.py --scale datacomp_1b --data_dir $data_dir
The above command will create the same directory structure under $data_dir
and can be modified as described above.
Downloading external data
The script download_upstream.py
can also be used to download other image-text datasets, using img2dataset.
Given parquet files containing the image urls and captions, you can use this script to download the images, by using the flag --metadata_dir
to point to the directory where the parquet files are stored.
By default, we also download the parquet files corresponding to the pools we provide, and this metadata is stored in a subfolder of $data_dir
.
Optimizing the download
When using img2dataset, there are several ways to optimize the download process such as using multiple nodes in a distributed environment or setting up a DNS resolver to increase the success rate of images being downloaded. See the img2dataset repository for further instructions on how to optimize the download process, as well as information on potential issues during the download.
Selecting samples in the filtering track
Before training, you will need to select the subset of samples you wish to use. Given a set of chosen samples, we create new shards with only those samples, which the training code then consumes. For each scale, models are trained for a fixed number of steps, regardless of the size of the chosen subset of the provided pool.
Each sample in our pool has a unique identifier, which is present in the metadata parquets, and in the json
files inside the .tar
shards.
The format describing the subset of samples should be a numpy array of dtype numpy.dtype("u8,u8")
(i.e. a structured array of pairs of unsigned 64-bit integers), with shape (subset_size,)
, containing a list of uid
s (128-bit hashes from the parquet files) in lexicographic sorted order, saved to disk in either npy
format or memory-mapped format.
For instance, if you have a list of uids uids = ['139e4a9b22a614771f06c700a8ebe150', '6e356964a967af455c8016b75d691203']
, you can store them by running the following python code:
processed_uids = np.array([(int(uid[:16], 16), int(uid[16:32], 16)) for uid in uids], np.dtype("u8,u8"))
processed_uids.sort()
np.save(out_filename, processed_uids)
After creating a subset, you may invoke the resharder to build the subset shards in $output_dir
like so:
python resharder.py -i $download_dir -o $output_dir -s $subset_file
If desired, the resharder can be run in parallel on multiple nodes. The easiest way to do so is to split the input directory into smaller subfolders with fewer shards, and run separate resharder jobs for each of them, each with to separate output directories.
Baselines
Here we provide command lines for the main filter baselines found in Table 3 of our paper, along with short descriptions. Each baseline reads the .parquet
metadata files (and also the .npz
files when needed) , selects a subset of uids
, sorts them, and saves them to a .npy
subset file. This file can then be input to the resharder described above to create a webdataset containing only the selected subset of the pool.
Note: the --num_workers
flag controls the number of metadata files that are read into memory and processed on parallel. It is set by default to the number of cores, but that may be too much for machine with many cores and limited memory. For baselines other than image-filtering, allow at least 256MB of memory per worker.
No filtering
Here we load all metadata uids
without any additional filtering.
python baselines.py --metadata_dir path/to/metadata --save_path path/to/no_filter.npy --name no_filter
Basic filtering
Simple checks on caption length, english being the detected caption language, image size, and image aspect ratio.
python baselines.py --metadata_dir path/to/metadata --save_path path/to/basic_filter.npy --name basic_filter
CLIP score filtering
Retain the top k=0.3 fraction of the pool by L/14 CLIP score.
python baselines.py --metadata_dir path/to/metadata --save_path path/to/clip_score_l14_30_percent.npy --name clip_score --arch l14 --fraction 0.3
Retain all examples with B/32 CLIP score above 0.25.
python baselines.py --metadata_dir path/to/metadata --save_path path/to/clip_score_b32_25_threshold.npy --name clip_score --arch b32 --threshold 0.25
LAION-2B filtering
Reproduces the filtering strategy used to create the LAION-2B dataset: applies a B/32 CLIP score filter on image-text pairs, retaining samples with score above 0.28, and an English filter using the gcld3 model to detect language.
python baselines.py --metadata_dir path/to/metadata --save_path path/to/laion.npy --name laion2b
Text-based filtering
A text filter captions that contain words from the ImageNet-21k synsets.
python baselines.py --metadata_dir path/to/metadata --save_path path/to/text_based.npy --name text_based
Image-based filtering
A image clustering based method that retains samples whose images have content close to ImageNet-1k training images, as measured by the nearest-neighbor cluster center of the image's L/14 CLIP embedding.
Note: this baseline uses GPU resources. By default it will try to use all GPUs. To control which GPUs are used, set the CUDA_VISIBLE_DEVICES
environment variable.
python baselines.py --metadata_dir path/to/metadata --save_path path/to/image_based.npy --name image_based --image_based_scale small --batch_size 512
Note: this baseline requires pre-computed image cluster centroids which will be downloaded automatically the first time you run it.
If you want to generate the centroids yourself, please see baselines/image_based_clustering.md
for instructions.
Intersection of image-based and CLIP score filtering
Applies both the CLIP score (L/14) with top 0.3 fraction
filter and an Image-based
filter. This is our best performing baseline for medium
, large
, and xlarge
scales. We used this strategy at the xlarge
scale to create the DataComp-1B
dataset.
Note: this baseline uses GPU resources. By default it will try to use all GPUs. To control which GPUs are used, set the CUDA_VISIBLE_DEVICES
environment variable.
python baselines.py --metadata_dir path/to/metadata --save_path path/to/image_based_intersect_clip_score_l14_30_percent.npy --name image_based_intersect_clip_score --image_based_scale small --batch_size 512 --arch l14 --fraction 0.3
Training
To train, run the following command:
torchrun --nproc_per_node $num_gpus train.py --scale $scale --data_dir $data_dir --output_dir $output_dir --exp_name $exp_name
We support using multiple different data directories. For instance, if your data is in /path/to/dir/1
and in /path/to/dir/2
, you can use the flag --data_dir=/path/to/dir/1::/path/to/dir/2
.
A sample script for training with SLURM with is provided at tools/slurm_train.sh
.
Hyper-parameters
The hyper-parameters used for training are fixed for a given scale. For the small
and medium
scales we use ViT-B/32 models, for the large
scale, ViT-B/16, and for the xlarge
scale, ViT-L/14. The number of samples seen during training is determined by the scale, and is equal to the size of the corresponding pool we provide. Additional details on hyper-parameters can be found in the paper.
You should not modify any hyper-parameters for training, including batch size. Any changes may affect accuracy and make results incomparable.
Note on variance across runs
We observed small (but non-zero) variance in performance when changing random seeds, seeing differences in accuracy typically at the range of 0.2 percentage points on ImageNet and up to 0.006 on average. We also note that some factors can make runs non-deterministic even when setting the same random seed (for example, random network failures when streaming data can cause different batches to be formed when re-running, see also https://pytorch.org/docs/stable/notes/randomness.html).
Evaluation
[Optional] Pre-download evaluation datasets
Pre-downloading evaluation datasets is optional if you have a strong Internet connection; by default, the data will be streamed directly from Hugging Face Hub. If you wish to download the data, run the following command, replacing $download_dir
with your desired download path:
python download_evalsets.py $download_dir
Evaluating
To evaluate, run the following command:
python evaluate.py --train_output_dir $train_output_dir/$exp_name
If you have already donwloaded the datasets, you can use the flag --data_dir
to point the code to the path where the data is stored.
By default, the evaluation script outputs to the same directory as $train_output_dir
. This can be changed with the flag --output_dir
on the evaluation script.
Note: This will not submit to our leaderboard unless you pass the --submit
flag.
Submitting
To submit, you'll run the evaluate script with some extra flags.
The submission script will upload files to Hugging Face Hub (like the model checkpoint and the file specifying the sample ids), and you will need a Hugging Face account for that, and a repository where these artifacts will be stored. To do so, follow these steps:
- Make sure you have
git-lfs
installed (rungit lfs install
if not) - Create a Hugging Face account at https://huggingface.co/join.
- Login to your Hugging Face account:
huggingface-cli login
- Create a repository where the data will be stored:
huggingface-cli repo create <REPO_NAME> --type model
.
Once you're ready to submit, run the evaluation script with some extra flags, for example:
python evaluate.py \
--track=filtering \
--train_output_dir=$train_output_dir \
--samples=$sample_files \
--dataset-size=1234568 \
--submit \
--method_name="[your method name, please be descriptive!]" \
--author="[your name]" \
--email="[[email protected]]" \
--hf_username=$hf_username \
--hf_repo_name=$hf_repo_name
Please note that the name of your method and the authors (and no other information) will be made publicly available in our leaderboard. Be sure to replace all fields with the correct information.
If you have a paper or blog post and would like that to be linked on our leaderboard, you can add that information with the --writeup
flag.
Important: We highly encourage users to specify the samples used to train the model using the --samples
flag. This can be either file(s) containing the uids of samples from our pool, and/or other files specifying the urls and captions for images outside our pool. You can specify multiple files using the ::
separator, for instance --samples=/path/to/sample_ids.npy::/path/to/custom_data.parquet
.
We also highly encourage participants to also upload the checkpoints for their trained models using the --upload-checkpoint
flag.
Checkpoints
We release the checkpoints for our main baselines as part of OpenCLIP. More details can be found at https://github.com/mlfoundations/open_clip/blob/main/docs/datacomp_models.md.
Citation
If you found this repository, our paper or data useful, please consider citing:
@article{datacomp,
title={DataComp: In search of the next generation of multimodal datasets},
author={Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexander Ratner, Shuran Song, Hannaneh Hajishirzi, Ali Farhadi, Romain Beaumont, Sewoong Oh, Alex Dimakis, Jenia Jitsev, Yair Carmon, Vaishaal Shankar, Ludwig Schmidt},
journal={arXiv preprint arXiv:2304.14108},
year={2023}
}