• Stars
    star
    107
  • Rank 312,776 (Top 7 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created over 5 years ago
  • Updated about 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Reliably download millions of images efficiently

Download Conceptual Captions Data

Place data from: https://ai.google.com/research/ConceptualCaptions/download in this folder

Train_GCC-training.tsv Training Split (3,318,333)

Validation_GCC-1.1.0-Validation.tsv Validation Split (15,840)

Test Split (~12,500) human approved image caption pairs is not public.

run download_data.py

Images will be in training and validation folders. You can stop and resume, the settings for splitting downloads into chunks / threads are not optimal, but it maxed out my connection so i kept them as is.

Note: A previous version of this script used a different file naming scheme, this changed and if you are resuming a previously started download, you will get duplicates.

A bunch of them will fail to download, and return web pages instead. These will need to be cleaned up later. See downloaded_validation_report.tsv after it downloads for HTTP errors. Around 8% of images are gone, based on validation set results. Setting the user agent could fix some errors too maybe - not sure if any requests are rejected by sites based on this.

It should take about a day or two to download the training data, keep an eye on disk space.

More Repositories

1

twitter-advanced-search

Advanced Search for Twitter.
1,163
star
2

awesome-twitter-algo

The release of the Twitter algorithm, annotated for recsys
470
star
3

stopwords

Default English stopword lists from many different sources
Python
280
star
4

ishkurs-guide-dataset

Structured Data from Ishkur's Guide to Electronic Music. Working Mirror for v2.5 here: https://igorbrigadir.github.io/ishkurs-guide-dataset/
Jupyter Notebook
43
star
5

twitter-history

Tracking significant changes to the Twitter API or platform as a whole
19
star
6

covid19-twitter-stream-tool

A tool to ingest the Twitter COVID-19 Labs Stream
Python
8
star
7

insight-templates

LaTeX Templates for Insight Centre for Data Analytics
TeX
7
star
8

word2vec-java

word2vec-java
Java
7
star
9

docker-spacy-gpu

Minimal example of a GPU Docker container that runs SpaCy Transformers
Dockerfile
7
star
10

newsir16-data

Additional External Data for Signal Media One-Million News Articles Dataset used in NewsIR 16 ECIR Workshop
Jupyter Notebook
6
star
11

twitter-glossary

A glossary of Twitter specific terminology.
5
star
12

awesome-bluesky-algo

The Bluesky algorithm, annotated for recsys. (Joking.. unless??)
5
star
13

bluesky-top-ua

Top Ukrainian Users on Bluesky
HTML
2
star
14

igorbrigadir

Profile
2
star
15

simetrix

Mavenized Fork of SIMetrix by Annie Louis
Java
2
star
16

twitter-ads-transparency

Data from the Twitter Ads Transparency Center https://ads.twitter.com/transparency
Python
2
star
17

carp

🐟 Twitter Carp Data
2
star
18

kaggle-word2vec

https://www.kaggle.com/c/word2vec-nlp-tutorial/
Python
2
star
19

SemEval2014-Task1

Distributional Semantic Model Tests for SemEval Task 1 Relatedness Subtask
Python
1
star
20

ROUGE-BEwTE

Mavenized Fork of "BEwT-E: Basic Elements with Transformations for Automated Evaluation of Summaries"
Java
1
star
21

Common-Recommender-REST-API

OpenAPI Spec for the Common Recommender REST API
1
star
22

tweet-delete

Makefile
1
star
23

power100-redux

Re Ranking Irish #power100 List
1
star
24

wiki-docs-submodule

Github wiki as git submodule
1
star
25

Count-von-Count

Count-vector-based distributional semantic approaches
Shell
1
star