Discover antoninodimaggio/Hugging-Captions Open Source project

Introduction

Hugging Captions fine-tunes GPT-2, a transformer-based language model by OpenAI, to generate realistic photo captions. All of the transformer stuff is implemented using Hugging Face's Transformers library, hence the name Hugging Captions.

Setup

Required

Python 3.6 +
CUDA 10.2 (Instructions for installing PyTorch on 9.2 or 10.1)

git clone https://github.com/antoninodimaggio/Hugging-Captions.git
cd Hugging-Captions
pip install -r requirements.txt

Download Training Data

It is important that you choose a hashtag that has more than 10,000 posts and is relevant to the photo you want to generate a caption for
Detailed information on each argument can be found here
You could also use python python download.py -h for help

python download.py --tag shibainu \
    --caption-queries 60 \
    --min-likes 10

Training and Generating Captions

Train

Now that we have our training data we can train (fine-tune) our transformer-based language model. The model will train fast on a decent GPU.

python tune_transformer.py --tag shibainu --train

Generate Captions

The most important argument is --prompt, you want too lead your model in the right direction, the more specific the better.
Detailed information on each argument can be found here
You could also use python tune_transformer.py -h for help

python tune_transformer.py --tag shibainu --generate \
    --prompt Adorable\ smile
    --max-length 60 \
    --min-length 20 \
    --num-captions 40

Train and Generate Captions

Trains and generates captions all in one go

python tune_transformer.py --tag shibainu --train --generate \
    --prompt Adorable\ smile
    --max-length 60 \
    --min-length 20 \
    --num-captions 40

See Your Results

Navigate to /Hugging-Captions/text/generated_text/<tag>_gen.txt to look at your generated captions

My Results Are Not What I Expected

Some of the generated captions are going to be ugly. Some of the generated captions are going to be really good but a word or two simply does not make sense. This is expected no matter how much the data, both training and generated, is cleaned. If you are not getting the results that you want I have four suggestions.

Choose a better hashtag. If you are captioning a photo of a dog do not choose #dog instead try #poodle, #bulldog, and so on.
Make your prompt more specific. A prompt like "My day" is very general and will lead to general results, instead try something like "My Saturday morning".
Increase your number of captions. The default is 40, bump that up to 80.
Increase the number of caption queries. The default is 60, raise that to say 100.

Future Work

Explore ways to better clean caption data both generated and training
Explore different pre-trained language models
Fine-tune models using caption data from multiple relevant hashtags

antoninodimaggio/Hugging-Captions

antoninodimaggio

Reviews

Repository Details