• Stars
    star
    121
  • Rank 293,924 (Top 6 %)
  • Language
    Jupyter Notebook
  • Created almost 3 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Official pytorch implementation of paper "RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution"

RSTT (Real-time Spatial Temporal Transformer)

PWC PWC PWC

This is the official pytorch implementation of the paper "RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution"

Zhicheng Geng*, Luming Liang*, Tianyu Ding and Ilya Zharkov

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022

Paper | Video

Introduction

Space-time video super-resolution (STVSR) is the task of interpolating videos with both Low Frame Rate (LFR) and Low Resolution (LR) to produce a High-Frame-Rate (HFR) and also High-Resolution (HR) counterpart. The existing methods based on Convolutional Neural Network (CNN) succeed in achieving visually satisfied results while suffer from slow inference speed due to their heavy architectures. We propose to resolve this issue by using a spatial-temporal transformer that naturally incorporates the spatial and temporal super resolution modules into a single model. Unlike CNN-based methods, we do not explicitly use separated building blocks for temporal interpolations and spatial super-resolutions; instead, we only use a single end-to-end transformer architecture. Specifically, a reusable dictionary is built by encoders based on the input LFR and LR frames, which is then utilized in the decoder part to synthesize the HFR and HR frames.

Performance

Below is performance of RSTT on Vid4 dataset using small (S), medium (M) and large (L) architectures compared to other baseline models. We plot FPS versus PSNR. Note that 24 FPS is the standard cinematic frame rate. We also plot the number of parameters (in millions) versus PSNR.

Architecture Overview

The features extracted from four input LFR and LR frames are processed by encoders Ek, k = 0, 1, 2, 3 to build dictionaries that will be used as inputs for the decoders Dk, k = 0, 1, 2, 3. The query builder generates a vector of queries Q which are then used to synthesize a sequence of seven consecutive HFR and HR frames.

Environment

Cuda 11.4

Python 3.8.11

torch 1.9.0 or higher

Installation

$ git clone https://github.com/llmpass/RSTT.git
$ pip install -r requirements.txt

Note that the torch version must be compatible to the cuda version, not necessary to be 1.9.0 here. For example, with cuda version 11.X, torch 1.9.0 is too old to use, may cause problems like

Cuda error: no kernel image is available for execution on the device 

Dataset Preparation

Download vimeo90k Septuplet dataset for training and evaluation:

http://toflow.csail.mit.edu/index.html#septuplet

Choose "The original training + test set (82GB)".

cp datasets/vimeo_septuplet/*.txt /path/to/vimeo/
python ./datasets/prepare_vimeo.py --path /path/to/vimeo/

Download Vid4 dataset for evaluation:

https://drive.google.com/drive/folders/10-gUO6zBeOpWEamrWKCtSkkUFukB9W5m

Train

Make sure writing a yml file with settings pointing to correct paths, for example:

python train.py --config ./configs/RSTT-S.yml

Evaluation

Vid4:

Make sure writing a yml file with settings pointing to correct paths, for example:

python eval_vid4.py --config ./configs/RSTT-S-eval-vid4.yml

Vimeo90k:

Make sure writing a yml file with settings pointing to correct paths, for example:

python eval_vimeo90k.py --config ./configs/RSTT-S-eval-vimeo90k.yml

Citation

@article{geng2022rstt,
  title={RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution},
  author={Zhicheng Geng and Luming Liang and Tianyu Ding and Ilya Zharkov},
  journal={arXiv preprint arXiv:2203.14186},
  year={2022}
}

or

@inproceedings{geng2022rstt,
  title={RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution},
  author={Zhicheng Geng and Luming Liang and Tianyu Ding and Ilya Zharkov},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={17441--17451},
  year={2022}
}

Acknowledgment

Our code is built on Zooming-Slow-Mo, EDVR, UFormer, and Swin-Transformer. We thank the authors for sharing their codes.

License

The code is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License for NonCommercial use only. Any commercial use should get formal permission first.