RAD-NeRF: Real-time Neural Talking Portrait Synthesis
This repository contains a PyTorch re-implementation of the paper: Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition.
Project Page | Arxiv | Data
A GUI for easy visualization:
obama.mp4
Install
Tested on Ubuntu 22.04, Pytorch 1.12 and CUDA 11.6.
git clone https://github.com/ashawkey/RAD-NeRF.git
cd RAD-NeRF
Install dependency
# for ubuntu, portaudio is needed for pyaudio to work.
sudo apt install portaudio19-dev
pip install -r requirements.txt
Build extension (optional)
By default, we use load
to build the extension at runtime.
However, this may be inconvenient sometimes.
Therefore, we also provide the setup.py
to build each extension:
# install all extension modules
bash scripts/install_ext.sh
Data pre-processing
Preparation:
## install pytorch3d
pip install "git+https://github.com/facebookresearch/pytorch3d.git"
## prepare face-parsing model
wget https://github.com/YudongGuo/AD-NeRF/blob/master/data_util/face_parsing/79999_iter.pth?raw=true -O data_utils/face_parsing/79999_iter.pth
## prepare basel face model
# 1. download `01_MorphableModel.mat` from https://faces.dmi.unibas.ch/bfm/main.php?nav=1-2&id=downloads and put it under `data_utils/face_tracking/3DMM/`
# 2. download other necessary files from AD-NeRF's repository:
wget https://github.com/YudongGuo/AD-NeRF/blob/master/data_util/face_tracking/3DMM/exp_info.npy?raw=true -O data_utils/face_tracking/3DMM/exp_info.npy
wget https://github.com/YudongGuo/AD-NeRF/blob/master/data_util/face_tracking/3DMM/keys_info.npy?raw=true -O data_utils/face_tracking/3DMM/keys_info.npy
wget https://github.com/YudongGuo/AD-NeRF/blob/master/data_util/face_tracking/3DMM/sub_mesh.obj?raw=true -O data_utils/face_tracking/3DMM/sub_mesh.obj
wget https://github.com/YudongGuo/AD-NeRF/blob/master/data_util/face_tracking/3DMM/topology_info.npy?raw=true -O data_utils/face_tracking/3DMM/topology_info.npy
# 3. run convert_BFM.py
cd data_utils/face_tracking
python convert_BFM.py
cd ../..
## prepare ASR model
# if you want to use DeepSpeech as AD-NeRF, you should install tensorflow 1.15 manually.
# else, we also support Wav2Vec in PyTorch.
Pre-processing Custom Training Video
-
Put training video under
data/<ID>/<ID>.mp4
.The video must be 25FPS, with all frames containing the talking person. The resolution should be about 512x512, and duration about 1-5min.
# an example training video from AD-NeRF mkdir -p data/obama wget https://github.com/YudongGuo/AD-NeRF/blob/master/dataset/vids/Obama.mp4?raw=true -O data/obama/obama.mp4
-
Run script (may take hours dependending on the video length)
# run all steps python data_utils/process.py data/<ID>/<ID>.mp4 # if you want to run a specific step python data_utils/process.py data/<ID>/<ID>.mp4 --task 1 # extract audio wave
-
File structure after finishing all steps:
./data/<ID> βββ<ID>.mp4 # original video βββori_imgs # original images from video β βββ0.jpg β βββ0.lms # 2D landmarks β βββ... βββgt_imgs # ground truth images (static background) β βββ0.jpg β βββ... βββparsing # semantic segmentation β βββ0.png β βββ... βββtorso_imgs # inpainted torso images β βββ0.png β βββ... βββaud.wav # original audio βββaud_eo.npy # audio features (wav2vec) βββaud.npy # audio features (deepspeech) βββbc.jpg # default background βββtrack_params.pt # raw head tracking results βββtransforms_train.json # head poses (train split) βββtransforms_val.json # head poses (test split)
Usage
Quick Start
We provide some pretrained models here for quick testing on arbitrary audio.
-
Download a pretrained model. For example, we download
obama_eo.pth
to./pretrained/obama_eo.pth
-
Download a pose sequence file. For example, we download
obama.json
to./data/obama.json
-
Prepare your audio as
<name>.wav
, and extract audio features.# if model is `<ID>_eo.pth`, it uses wav2vec features python nerf/asr.py --wav data/<name>.wav --save_feats # save to data/<name>_eo.npy # if model is `<ID>.pth`, it uses deepspeech features python data_utils/deepspeech_features/extract_ds_features.py --input data/<name>.wav # save to data/<name>.npy
You can download pre-processed audio features too. For example, we download
intro_eo.npy
to./data/intro_eo.npy
. -
Run inference: It takes about 2GB GPU memory to run inference at 40FPS (measured on a V100).
# save video to trail_obama/results/*.mp4 # if model is `<ID>.pth`, should append `--asr_model deepspeech` and use `--aud intro.npy` instead. python test.py --pose data/obama.json --ckpt pretrained/obama_eo.pth --aud data/intro_eo.npy --workspace trial_obama/ -O --torso # provide a background image (default is white) python test.py --pose data/obama.json --ckpt pretrained/obama_eo.pth --aud data/intro_eo.npy --workspace trial_obama/ -O --torso --bg_img data/bg.jpg # test with GUI python test.py --pose data/obama.json --ckpt pretrained/obama_eo.pth --aud data/intro_eo.npy --workspace trial_obama/ -O --torso --bg_img data/bg.jpg --gui
Detailed Usage
First time running will take some time to compile the CUDA extensions.
# train (head)
# by default, we load data from disk on the fly.
# we can also preload all data to CPU/GPU for faster training, but this is very memory-hungry for large datasets.
# `--preload 0`: load from disk (default, slower).
# `--preload 1`: load to CPU, requires ~70G CPU memory (slightly slower)
# `--preload 2`: load to GPU, requires ~24G GPU memory (fast)
python main.py data/obama/ --workspace trial_obama/ -O --iters 200000
# train (finetune lips for another 50000 steps, run after the above command!)
python main.py data/obama/ --workspace trial_obama/ -O --iters 250000 --finetune_lips
# train (torso)
# <head>.pth should be the latest checkpoint in trial_obama
python main.py data/obama/ --workspace trial_obama_torso/ -O --torso --head_ckpt <head>.pth --iters 200000
# test on the test split
python main.py data/obama/ --workspace trial_obama/ -O --test # use head checkpoint, will load GT torso
python main.py data/obama/ --workspace trial_obama_torso/ -O --torso --test
# test with GUI
python main.py data/obama/ --workspace trial_obama_torso/ -O --torso --test --gui
# test with GUI (load speech recognition model for real-time application)
python main.py data/obama/ --workspace trial_obama_torso/ -O --torso --test --gui --asr
# test with specific audio & pose sequence
# --test_train: use train split for testing
# --data_range: use this range's pose & eye sequence (if shorter than audio, automatically mirror and repeat)
python main.py data/obama/ --workspace trial_obama_torso/ -O --torso --test --test_train --data_range 0 100 --aud data/intro_eo.npy
check the scripts
directory for more provided examples.
Acknowledgement
- The data pre-processing part is adapted from AD-NeRF.
- The NeRF framework is based on torch-ngp.
- The GUI is developed with DearPyGui.
Citation
@article{tang2022radnerf,
title={Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition},
author={Tang, Jiaxiang and Wang, Kaisiyuan and Zhou, Hang and Chen, Xiaokang and He, Dongliang and Hu, Tianshu and Liu, Jingtuo and Zeng, Gang and Wang, Jingdong},
journal={arXiv preprint arXiv:2211.12368},
year={2022}
}