RSTT (Real-time Spatial Temporal Transformer)
This is the official pytorch implementation of the paper "RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution"
Zhicheng Geng*, Luming Liang*, Tianyu Ding and Ilya Zharkov
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022
Introduction
Space-time video super-resolution (STVSR) is the task of interpolating videos with both Low Frame Rate (LFR) and Low Resolution (LR) to produce a High-Frame-Rate (HFR) and also High-Resolution (HR) counterpart. The existing methods based on Convolutional Neural Network (CNN) succeed in achieving visually satisfied results while suffer from slow inference speed due to their heavy architectures. We propose to resolve this issue by using a spatial-temporal transformer that naturally incorporates the spatial and temporal super resolution modules into a single model. Unlike CNN-based methods, we do not explicitly use separated building blocks for temporal interpolations and spatial super-resolutions; instead, we only use a single end-to-end transformer architecture. Specifically, a reusable dictionary is built by encoders based on the input LFR and LR frames, which is then utilized in the decoder part to synthesize the HFR and HR frames.
Performance
Below is performance of RSTT on Vid4 dataset using small (S), medium (M) and large (L) architectures compared to other baseline models. We plot FPS versus PSNR. Note that 24 FPS is the standard cinematic frame rate. We also plot the number of parameters (in millions) versus PSNR.
Architecture Overview
The features extracted from four input LFR and LR frames are processed by encoders Ek, k = 0, 1, 2, 3 to build dictionaries that will be used as inputs for the decoders Dk, k = 0, 1, 2, 3. The query builder generates a vector of queries Q which are then used to synthesize a sequence of seven consecutive HFR and HR frames.
Environment
Cuda 11.4
Python 3.8.11
torch 1.9.0 or higher
Installation
$ git clone https://github.com/llmpass/RSTT.git
$ pip install -r requirements.txt
Note that the torch version must be compatible to the cuda version, not necessary to be 1.9.0 here. For example, with cuda version 11.X, torch 1.9.0 is too old to use, may cause problems like
Cuda error: no kernel image is available for execution on the device
Dataset Preparation
Download vimeo90k Septuplet dataset for training and evaluation:
http://toflow.csail.mit.edu/index.html#septuplet
Choose "The original training + test set (82GB)".
cp datasets/vimeo_septuplet/*.txt /path/to/vimeo/
python ./datasets/prepare_vimeo.py --path /path/to/vimeo/
Download Vid4 dataset for evaluation:
https://drive.google.com/drive/folders/10-gUO6zBeOpWEamrWKCtSkkUFukB9W5m
Train
Make sure writing a yml file with settings pointing to correct paths, for example:
python train.py --config ./configs/RSTT-S.yml
Evaluation
Vid4:
Make sure writing a yml file with settings pointing to correct paths, for example:
python eval_vid4.py --config ./configs/RSTT-S-eval-vid4.yml
Vimeo90k:
Make sure writing a yml file with settings pointing to correct paths, for example:
python eval_vimeo90k.py --config ./configs/RSTT-S-eval-vimeo90k.yml
Citation
@article{geng2022rstt,
title={RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution},
author={Zhicheng Geng and Luming Liang and Tianyu Ding and Ilya Zharkov},
journal={arXiv preprint arXiv:2203.14186},
year={2022}
}
or
@inproceedings{geng2022rstt,
title={RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution},
author={Zhicheng Geng and Luming Liang and Tianyu Ding and Ilya Zharkov},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={17441--17451},
year={2022}
}
Acknowledgment
Our code is built on Zooming-Slow-Mo, EDVR, UFormer, and Swin-Transformer. We thank the authors for sharing their codes.
License
The code is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License for NonCommercial use only. Any commercial use should get formal permission first.