StyleSDF: High-Resolution 3D-Consistent Image and Geometry Generation
Project Page | Paper | HuggingFace Demo
Roy Or-El1 ,
Xuan Luo1,
Mengyi Shan1,
Eli Shechtman2,
Jeong Joon Park3,
Ira Kemelmacher-Shlizerman1
1University of Washington, 2Adobe Research, 3Stanford University
Updates
12/26/2022: A new HuggingFace demo is now available. Special thanks to @SerdarHelli for the implementation.
3/27/2022: Fixed a bug in the sphere initialization code (init_forward function was missing, see commit 0bd8741).
3/22/2022: Added training files.
3/9/2022: Fixed a bug in the calculation of the mean w vector (see commit d4dd17d).
3/4/2022: Testing code and Colab demo were released. Training files will be released soon.
Overview
StyleSDF is a 3D-aware GAN, aimed at solving two main challenges:
- High-resolution, view-consistent generation of the RGB images.
- Generating detailed 3D shapes.
StyleSDF is trained only on single-view RGB data. The 3D geometry is learned implicitly with an SDF-based volume renderer.
This code is the official PyTorch implementation of the paper:
StyleSDF: High-Resolution 3D-Consistent Image and Geometry Generation
Roy Or-El, Xuan Luo, Mengyi Shan, Eli Shechtman, Jeong Joon Park, Ira Kemelmacher-Shlizerman
CVPR 2022
https://arxiv.org/pdf/2112.11427.pdf
Abstract
We introduce a high resolution, 3D-consistent image and shape generation technique which we call StyleSDF. Our method is trained on single-view RGB data only, and stands on the shoulders of StyleGAN2 for image generation, while solving two main challenges in 3D-aware GANs: 1) high-resolution, view-consistent generation of the RGB images, and 2) detailed 3D shape. We achieve this by merging a SDF-based 3D representation with a style-based 2D generator. Our 3D implicit network renders low-resolution feature maps, from which the style-based network generates view-consistent, 1024Γ1024 images. Notably, our SDFbased 3D modeling defines detailed 3D surfaces, leading to consistent volume rendering. Our method shows higher quality results compared to state of the art in terms of visual and geometric quality.
Pre-Requisits
You must have a GPU with CUDA support in order to run the code.
This code requires PyTorch, PyTorch3D and torchvision to be installed, please go to PyTorch.org and PyTorch3d.org for installation info.
We tested our code on Python 3.8.5, PyTorch 1.9.0, PyTorch3D 0.6.1 and torchvision 0.10.0.
The following packages should also be installed:
- lmdb
- numpy
- ninja
- pillow
- requests
- tqdm
- scipy
- skimage
- skvideo
- trimesh[easy]
- configargparse
- munch
- wandb (optional)
If any of these packages are not installed on your computer, you can install them using the supplied requirements.txt
file:
pip install -r requirements.txt
Download Pre-trained Models
The pre-trained models can be downloaded by running python download_models.py
.
Quick Demo
You can explore our method in Google Colab .
Alternatively, you can download the pretrained models by running:
python download_models.py
To generate human faces from the model pre-trained on FFHQ, run:
python generate_shapes_and_images.py --expname ffhq1024x1024 --size 1024 --identities NUMBER_OF_FACES
To generate animal faces from the model pre-trained on AFHQ, run:
python generate_shapes_and_images.py --expname afhq512x512 --size 512 --identities NUMBER_OF_FACES
Generating images and meshes
To generate images and meshes from a trained model, run:
python generate_shapes_and_images.py --expname NAME_OF_TRAINED_MODEL --size MODEL_OUTPUT_SIZE --identities NUMBER_OF_FACES
The script will generate an RGB image, a mesh generated from depth map, and the mesh extracted with Marching cubes.
Optional flags for image and mesh generation
--no_surface_renderings When true, only RGB outputs will be generated. otherwise, both RGB and depth videos/renderings will be generated. (default: false)
--fixed_camera_angles When true, the generator will render indentities from a fixed set of camera angles. (default: false)
Generating Videos
To generate videos from a trained model, run:
python render_video.py --expname NAME_OF_TRAINED_MODEL --size MODEL_OUTPUT_SIZE --identities NUMBER_OF_FACES
.
This script will generate RGB video as well as depth map video for each identity. The average processing time per video is ~5-10 minutes on an RTX2080 Ti GPU.
Optional flags for video rendering
--no_surface_videos When true, only RGB video will be generated when running render_video.py. otherwise, both RGB and depth videos will be generated. this cuts the processing time per video. (default: false)
--azim_video When true, the camera trajectory will travel along the azimuth direction. Otherwise, the camera will travel along an ellipsoid trajectory. (default: ellipsoid)
--project_noise When true, use geometry-aware noise projection to reduce flickering effects (see supplementary section C.1 in the paper). Warning: processing time significantly increases with this flag to ~20 minutes per video. (default: false)
Training
Preparing your Dataset
If you wish to train a model from scratch, first you need to convert your dataset to an lmdb format. Run:
python prepare_data.py --out_path OUTPUT_LMDB_PATH --n_worker N_WORKER --size SIZE1,SIZE2,SIZE3,... INPUT_DATASET_PATH
Training the volume renderer
Training scripts
To train the volume renderer on FFHQ run: bash ./scripts/train_ffhq_vol_renderer.sh
.
To train the volume renderer on AFHQ run: bash ./scripts/train_afhq_vol_renderer.sh
.
- The scripts above use distributed training. To train the models on a single GPU (not recommended) remove
-m torch.distributed.launch --nproc_per_node NUM_GPUS
from the script.
Training on a new dataset
To train the volume renderer on a new dataset, run:
python train_volume_renderer.py --batch BATCH_SIZE --chunk CHUNK_SIZE --expname EXPERIMENT_NAME --dataset_path DATASET_PATH
Ideally, CHUNK_SIZE
should be the same as BATCH_SIZE
, but on most GPUs it will likely cause an out of memory error. In such case, reduce CHUNK_SIZE
to perform gradient accumulation.
Important note: The best way to monitor the SDF convergence is to look at "Beta value" graph on wandb. Convergence is successful once beta reaches values below approx. 3*10-3. If the SDF is not converging, increase the R1 regularization weight. Another helpful option (to a lesser degree) is to decrease the weight of the minimal surface regularization.
Distributed training
If you have multiple GPUs you can train your model on multiple instances by running:
python -m torch.distributed.launch --nproc_per_node NUM_GPUS train_volume_renderer.py --batch BATCH_SIZE --chunk CHUNK_SIZE --expname EXPERIMENT_NAME --dataset_path DATASET_PATH
Training the full pipeline
Training scripts
To train the volume renderer on FFHQ run: bash ./scripts/train_ffhq_full_pipeline_1024x1024.sh
.
To train the volume renderer on AFHQ run: bash ./scripts/train_afhq_full_pipeline_512x512.sh
.
- The scripts above assume that the volume renderer model was already trained. Do not run them from scratch.
- The scripts above use distributed training. To train the models on a single GPU (not recommended) remove
-m torch.distributed.launch --nproc_per_node NUM_GPUS
from the script.
Training on a new dataset
To train the full pipeline on a new dataset, first train the volume renderer separately.
After the volume renderer training is finished, run:
python train_full_pipeline.py --batch BATCH_SIZE --chunk CHUNK_SIZE --expname EXPERIMENT_NAME --size OUTPUT_SIZE
Ideally, CHUNK_SIZE
should be the same as BATCH_SIZE
, but on most GPUs it will likely cause an out of memory error. In such case, reduce CHUNK_SIZE
to perform gradient accumulation.
Distributed training
If you have multiple GPUs you can train your model on multiple instances by running:
python -m torch.distributed.launch --nproc_per_node NUM_GPUS train_full_pipeline.py --batch BATCH_SIZE --chunk CHUNK_SIZE --expname EXPERIMENT_NAME --size OUTPUT_SIZE
Here, BATCH_SIZE represents the batch per GPU, not the overall batch size.
Optional training flags
Training regime options:
--iter Total number of training iterations. (default: 300,000)
--wandb Use use weights and biases logging. (default: False)
--r1 Weight of the r1 regularization. (default: 10.0)
--view_lambda Weight of the viewpoint regularization. (Equation 6, default: 15)
--eikonal_lambda Weight of the eikonal regularization. (Equation 7, default: 0.1)
--min_surf_lambda Weight of the minimal surface regularization. (Equation 8, default: 0.05)
Camera options:
--uniform When true, the camera position is sampled from uniform distribution. (default: gaussian)
--azim Camera azimuth angle std (guassian)/range (uniform) in Radians. (default: 0.3 Rad.)
--elev Camera elevation angle std (guassian)/range (uniform) in Radians. (default: 0.15 Rad.)
--fov Camera field of view half angle in **Degrees**. (default: 6 Deg.)
--dist_radius Radius of points sampling distance from the origin. Determines the near and far fields. (default: 0.12)
Citation
If you use this code for your research, please cite our paper.
@InProceedings{orel2022stylesdf,
title={Style{SDF}: {H}igh-{R}esolution {3D}-{C}onsistent {I}mage and {G}eometry {G}eneration},
author = {Or-El, Roy and
Luo, Xuan and
Shan, Mengyi and
Shechtman, Eli and
Park, Jeong Joon and
Kemelmacher-Shlizerman, Ira},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2022},
pages = {13503-13513}
}
Acknowledgments
This code is inspired by rosinality's StyleGAN2-PyTorch and Yen-Chen Lin's NeRF-pytorch.
A special thanks to @SerdarHelli for implementing the HuggingFace demo.