Understanding Deformable Convolution
Keras / TensorFlow implementation of deformable convolution.
Dai, Jifeng, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. 2017. “Deformable Convolutional Networks.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/1703.06211
Check out https://medium.com/@phelixlau/notes-on-deformable-convolutional-networks-baaabbc11cf3 for my summary of the paper.
Experiment on MNIST and Scaled Data Augmentation
To demonstrate the effectiveness of deformable convolution with scaled images, we show that by simply replacing regular convolution with deformable convolution and fine-tuning just the offsets with a scale-augmented datasets, deformable CNN performs significantly better than regular CNN on the scaled MNIST dataset. This indicates that deformable convolution is able to more effectively utilize already learned feature map to represent geometric distortion.
First, we train a 4-layer CNN with regular convolution on MNIST without any data augmentation. Then we replace all regular convolution layers with deformable convolution layers and freeze the weights of all layers except the newly added convolution layers responsible for learning the offsets. This model is then fine-tuned on the scale-augmented MNIST dataset.
In this set up, the deformable CNN is forced to make better use of the learned feature map by only changing its receptive field.
Note that the deformable CNN did not receive additional supervision other than the labels and is trained with cross-entropy just like the regular CNN.
Test Accuracy | Regular CNN | Deformable CNN |
---|---|---|
Regular MNIST | 98.74% | 97.27% |
Scaled MNIST | 57.01% | 92.55% |
Please refer to scripts/scaled_mnist.py
for reproducing this result.
Notes on Implementation
- This implementation is not efficient. In fact a forward pass with deformed convolution takes 260 ms, while regular convolution takes only 10 ms. Also, GPU average utilization is only around 10%.
- This implementation also does not take advantage of the fact that offsets and
the input have similar shape (in
tf_batch_map_offsets
). (So STN-style bilinear sampling will help) - The TensorFlow Keras backend must be used