igorbrigadir/DownloadConceptualCaptions

Stars
107
Rank 312,776 (Top 7 %)
Language
Jupyter Notebook
License
MIT License
Created over 5 years ago
Updated about 3 years ago

igorbrigadir/DownloadConceptualCaptions

igorbrigadir

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Reliably download millions of images efficiently

Download Conceptual Captions Data

Place data from: https://ai.google.com/research/ConceptualCaptions/download in this folder

Train_GCC-training.tsv Training Split (3,318,333)

Validation_GCC-1.1.0-Validation.tsv Validation Split (15,840)

Test Split (~12,500) human approved image caption pairs is not public.

run download_data.py

Images will be in training and validation folders. You can stop and resume, the settings for splitting downloads into chunks / threads are not optimal, but it maxed out my connection so i kept them as is.

Note: A previous version of this script used a different file naming scheme, this changed and if you are resuming a previously started download, you will get duplicates.

A bunch of them will fail to download, and return web pages instead. These will need to be cleaned up later. See downloaded_validation_report.tsv after it downloads for HTTP errors. Around 8% of images are gone, based on validation set results. Setting the user agent could fix some errors too maybe - not sure if any requests are rejected by sites based on this.

It should take about a day or two to download the training data, keep an eye on disk space.

twitter-advanced-search

Advanced Search for Twitter.

awesome-twitter-algo

The release of the Twitter algorithm, annotated for recsys

stopwords

Default English stopword lists from many different sources

ishkurs-guide-dataset

Structured Data from Ishkur's Guide to Electronic Music. Working Mirror for v2.5 here: https://igorbrigadir.github.io/ishkurs-guide-dataset/

Jupyter Notebook

twitter-history

Tracking significant changes to the Twitter API or platform as a whole

covid19-twitter-stream-tool

A tool to ingest the Twitter COVID-19 Labs Stream

insight-templates

LaTeX Templates for Insight Centre for Data Analytics

word2vec-java

docker-spacy-gpu

Minimal example of a GPU Docker container that runs SpaCy Transformers

newsir16-data

Additional External Data for Signal Media One-Million News Articles Dataset used in NewsIR 16 ECIR Workshop

Jupyter Notebook

twitter-glossary

A glossary of Twitter specific terminology.

awesome-bluesky-algo

The Bluesky algorithm, annotated for recsys. (Joking.. unless??)

bluesky-top-ua

Top Ukrainian Users on Bluesky

igorbrigadir

simetrix

Mavenized Fork of SIMetrix by Annie Louis

twitter-ads-transparency

Data from the Twitter Ads Transparency Center https://ads.twitter.com/transparency

carp

🐟 Twitter Carp Data

kaggle-word2vec

https://www.kaggle.com/c/word2vec-nlp-tutorial/

SemEval2014-Task1

Distributional Semantic Model Tests for SemEval Task 1 Relatedness Subtask

ROUGE-BEwTE

Mavenized Fork of "BEwT-E: Basic Elements with Transformations for Automated Evaluation of Summaries"

Common-Recommender-REST-API

OpenAPI Spec for the Common Recommender REST API

tweet-delete

power100-redux

Re Ranking Irish #power100 List

wiki-docs-submodule

Github wiki as git submodule

Count-von-Count

Count-vector-based distributional semantic approaches