object-localization
This project shows how to localize objects in images by using simple convolutional neural networks.
Dataset
Before getting started, we have to download a dataset and generate a csv file containing the annotations (boxes).
- Download The Oxford-IIIT Pet Dataset
- Download The Oxford-IIIT Pet Dataset Annotations
- tar xf images.tar.gz
- tar xf annotations.tar.gz
- mv annotations/xmls/* images/
- python3 generate_dataset.py
Single-object detection
Example 1: Finding dogs/cats
Architecture
First, let's look at YOLOv2's approach:
- Pretrain Darknet-19 on ImageNet (feature extractor)
- Remove the last convolutional layer
- Add three 3 x 3 convolutional layers with 1024 filters
- Add a 1 x 1 convolutional layer with the number of outputs needed for detection
We proceed in the same way to build the object detector:
- Choose a model from Keras Applications i.e. feature extractor
- Remove the dense layer
- Freeze some/all/no layers
- Add one/multiple/no convolution block (or
_inverted_res_block
for MobileNetv2) - Add a convolution layer for the coordinates
The code in this repository uses MobileNetv2, because it is faster than other models and the performance can be adapted. For example, if alpha = 0.35 with 96x96 is not good enough, one can just increase both values (see here for a comparison). If you use another architecture, change preprocess_input
.
python3 example_1/train.py
- Adjust the WEIGHTS_FILE in
example_1/test.py
(given by the last script) python3 example_1/test.py
Result
In the following images red is the predicted box, green is the ground truth:
Example 2: Finding dogs/cats and distinguishing classes
This time we have to run the scripts example_2/train.py
and example_2/test.py
.
Changes
In order to distinguish between classes, we have to modify the loss function. I'm using here w_1*log((y_hat - y)^2 + 1) + w_2*FL(p_hat, p)
where w_1 = w_2 = 1
are two weights and FL(p_hat, p) = -(0.9(1 - p_hat)^2 p*log(p_hat) + 0.1*p_hat^2(1 - p)log(1-p_hat))
(focal loss).
Instead of using all 37 classes, the code will only output class 0 (contains only class 0) or class 1 (contains class 1 to 36). However, it is easy to extend this to more classes (use categorical cross entropy instead of focal loss and try out different weights).
Multi-object detection
Example 3: Segmentation-like detection
Architecture
In this example, we use a skip-net architecture similar to U-Net. For an in-depth explanation see my blog post.
Result
Example 4: YOLO-like detection
Architecture
This example is based on the three YOLO papers. For an in-depth explanation see this blog post.
Result
Guidelines
Improve accuracy (IoU)
- enable augmentations: see
example_4
the same code can be added to the other examples - better augmentations: try out different values (flips, rotation etc.)
- for MobileNetv1/2: increase
ALPHA
andIMAGE_SIZE
in train_model.py - other architectures: increase
IMAGE_SIZE
- add more layers
- try out other loss functions (MAE, smooth L1 loss etc.)
- other optimizer: SGD with momentum 0.9, adjust learning rate
- use a feature pyramid
- read keras-team/keras#9965
Increase training speed
- increase
BATCH_SIZE
- less layers,
IMAGE_SIZE
andALPHA
Overfitting
- If the new dataset is small and similar to ImageNet, freeze all layers.
- If the new dataset is small and not similar to ImageNet, freeze some layers.
- If the new dataset is large, freeze no layers.
- read http://cs231n.github.io/transfer-learning/