• Stars
    star
    721
  • Rank 62,353 (Top 2 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 1 year ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

(ICCV2023) official repository for "Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation"

Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation

Rui Chen*, Yongwei Chen*, Ningxin Jiao, Kui Jia

*equal contribution

Paper | ArXiv | Project Page | Supp_material | Video

FAQs

Q1: About the use of normal and mask images as the input of stable diffusion model and analysis

Answer: Our initial hypothesis is that normal and mask images, representing local and silhouette information of shapes respectively, can benefit geometry learning. Additionally, we observed that the value range of the normal map is normalized to (-1, 1), which aligns with the data range required for latent space diffusion. Our empirical studies validate this hypothesis. Further support for our hypothesis comes from the presence of normal images in the LAION-5B dataset used for training Stable Diffusion (see Website for retrieval of normal data in LAION-5B). Therefore, the normal data is not considered an out-of-distribution (OOD) input for stable diffusion. To handle rough and coarse geometry in the early stage of learning, we directly utilize concatenated 64 $\times$ 64 $\times$ 4 (normal, mask) images as the latent code, inspired by Latent-NeRF, to achieve better convergence. However, using the normal map without VAE encoding in the world coordinate system may lead to inconsistencies with the data distribution of the latent space trained by VAE. This mismatch can cause the generated geometry to deviate from the text description in some cases. To address this issue, we employ a data augmentation technique by randomly rotating the normal map rendered from the current view. This approach brings the distribution of the normal map closer to the distribution of latent space data. We experimentally observe that it improves the alignment between the generated geometry and the text description. As the learning progresses, it becomes essential to render the 512 $\times$ 512 $\times$ 3 high-resolution normal image for capturing finer geometry details, and we choose to use normal image only in the later stage. This strategy strikes an accuracy-efficiency balance throughout the geometry optimization process.

Q2: Hypothesis-verification analysis of the disentangled representation

Answer: Previous methods (e.g., DreamFusion and Magic3D) couple the geometry and appearance generation together, following NeRF. Our adoption of the disentangled representation is mainly motivated by the difference of problem nature for generating surface geometry and appearance. In fact, when dealing with finer recovery of surface geometry from multi-view images, methods (e.g., VolSDF, nvdiffrec, etc) that explicitly take the surface modeling into account triumph; our disentangled representation enjoys the benefit similar to these methods. The disentangled representation also enables us to include the BRDF material representation in the appearance modeling, achieving better photo-realistic rendering by the BRDF physical prior.

Q3: Can Fantasia3D directly fine-tune the mesh given by the user?

Answer: Yes, it can. Fantasia3D can receive any mesh given by the user and fine-tune it using our method of user-guided generation. It can also naturally interface with the 3D generative method like shape-e and point-e. In a word, Fantasia3D can generate highly detailed and high-fidelity 3D content based on either the low-quality mesh given by the users or the ellipsoid.

What do you want?

Considering that parameter tuning may require some experience, what kind of object do you want me to generate? Please speak freely in the issue area. I will take some time to implement some requirements and update the corresponding configuration files for your convenience in reproducing.

Contribute to Fantasia3D

Firstly, upload the videos converted from gifs using the Website, including the geometry or appearance, to the Gallery. Write down the text to generate the object, the performance, the resolution of the tetrahedron for geometry modeling, and the strategy adopted for appearance modeling.

Subsequently, upload the configuration file under the directory of configs. If you will upload the file about the user-guided generation, the guided mesh should also be uploaded under the directory of data. The naming rule of the file is as follows.

For the file of zero-shot geometry modeling:

{The key word of the text}_geometry_zero_shot_{the number of gpu}_gpu.json

For the file of user-guided geometry modeling:

{The key word of the text}_geometry_user_guided_{the number of gpu}_gpu.json

For the file of appearance modeling:

{The key word of the text}_appearance_strategy{the strategy adopted}_{the number of gpu}_gpu.json.

Install

  • System requirement: Ubuntu20.04
  • Tested GPUs: RTX3090, RTX4090

We provide two choices to install the environment.

  • (Option 1) Use the file requirements.txt to install all packages one by one. It may fail since the complexity of some packages.

    pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html
    pip install -r requirements.txt
  • (Option 2) Use the docker image to deploy the environment in the Ubuntu system quickly.

    docker pull registry.cn-guangzhou.aliyuncs.com/baopin/fantasia3d:1.0

    Due to the Internet Network Delay, the package of xformers was not installed in this docker image. Install it by hand after you create a docker container using this docker image.

    pip install git+https://github.com/facebookresearch/xformers.git@main#egg=xformers

After the successful deployment of the environment, clone the repository of Fantasia3D and get started.

git clone https://github.com/Gorilla-Lab-SCUT/Fantasia3D.git
cd Fantasia3D

Start

All the results in the paper were generated using 8 3090 GPUs. We cannot guarantee that fewer than 8 GPUs can achieve the same effect.

  • zero-shot generation
# Multi-GPU training
...
# Geometry modeling using 8 GPU 
python3 -m torch.distributed.launch --nproc_per_node=8 train.py --config configs/car_geometry.json
# Geometry modeling using 4 GPU
python3 -m torch.distributed.launch --nproc_per_node=4 train.py --config configs/car_geometry.json
# Appearance modeling using 8 GPU
python3 -m torch.distributed.launch --nproc_per_node=8 train.py --config configs/car_appearance_strategy0.json
# Appearance modeling using 4 GPU
python3 -m torch.distributed.launch --nproc_per_node=4 train.py --config configs/car_appearance_strategy0.json
...
# Single GPU training (Only test on the pineapple). 
# Geometry modeling. It takes about 15 minutes on 3090 GPU.
python3  train.py --config configs/pineapple_geometry_single_gpu.json
# Appearance modeling. It takes about 15 minutes on 3090 GPU.
python3  train.py --config configs/pineapple_appearance_strategy0_single_gpu.json
  • user-guided generation
# Multi-GPU training
...
# Geometry modeling using 8 GPU
python3 -m torch.distributed.launch --nproc_per_node=8 train.py --config configs/Gundam_geometry.json
# Geometry modeling using 4 GPU
python3 -m torch.distributed.launch --nproc_per_node=4 train.py --config configs/Gundam_geometry.json
# Appearance modeling using 8 GPU
python3 -m torch.distributed.launch --nproc_per_node=8 train.py --config configs/Gundam_appearance.json
# Appearance modeling using 4 GPU
python3 -m torch.distributed.launch --nproc_per_node=4 train.py --config configs/Gundam_appearance.json
...
# Single GPU training
# Geometry modeling
python3  train.py --config configs/Gundam_geometry.json
# Appearance modeling
python3  train.py --config configs/Gundam_appearance.json

Tips

  • (both) Train longer. Training longer may help with the finer details. You can train longer by setting the parameter "iter".

  • (both) Larger batch size. A larger batch size can help with the faster convergence. Corresponding parameter is "batch".

  • (both) Try different seeds. Different seeds can bring diverse results.

  • (both) Scale the object. Increasing the proportion of initialized objects in the FOV = 45 screens can reinforce the quality of both the geometry and appearance modeling. For geometry modeling, it can attain more local geometric details. For appearance modeling, this method can reduce the probability of saturated or strange colors appearing, as it reduces the proportion of background colors in the image. We found that if the proportion of background color is too high, it can easily lead to saturation and strange colors.

  • (geometry modeling) Provide a proportional prior of the target shape. You can scale the default sphere with a radius of 1 to an ellipsoid. For instance, make the radius of the ellipsoid on the z-axis larger if you want to generate "A car made out of cheese".

    "mode": "geometry_modeling",
    "sdf_init_shape": "ellipsoid",
    "sdf_init_shape_scale": [0.56, 0.56, 0.84]

    There is a situation where ellipsoid cannot provide a proportional prior, such as the generation of an animal. In this case, using ellipsoid initialization can easily cause the generated animal to have multiple feet. Run the following command to examine:

    python3 -m torch.distributed.launch --nproc_per_node=8 train.py --config configs/elephant_geometry_fail_multi_face.json 

    Instead, you can use the sketch shape of a quadruped as a proportional prior to generating any animal shape you want.

    python3 -m torch.distributed.launch --nproc_per_node=8 train.py --config configs/elephant_geometry_succeed.json

    In other situations, such as the generation of the human-like body, a human sketch shape can be used.

    python3 -m torch.distributed.launch --nproc_per_node=8 train.py --config configs/Gundam_geometry.json
  • (geometry modeling) Increae the number of iterations in the early phase. The early phase is very crucial to create a coarse and correct shape. The late phase just focuses on attaining finer geometry details so there will be no significant changes in the overall shape. Increase the number of the parameter "coarse_iter" if you find that the contour of the geometric shape does not match the text description.

  • (geometry modeling) Use larger resolution of the tetrahedron. A larger resolution can bring more details in the local geometry. You can easily change the resolution by modifying the value of the parameter "dmtet_grid" to 128 or 256. Note that if you find that the mesh quickly disappears or disperses when using 256 resolution, decrease the guidance weight of SDS loss from default 100 to 50. In my experience, a single GPU is suitable for using a resolution of 128 instead of 256. If you want to obtain a high-detail model at 256 resolution, multi-GPU training is necessary. In addition, the effect of multiple GPUs is much better than a single GPU for objects with obvious directionality, such as human head statues. BTW, using the gradient accumulation technique for a single GPU may achieve the effect of multiple GPUs, but I haven't tested it yet.

  • (geometry modeling) Use different range of time step in the early phase. We usually use the time steps range [0.02,0.5] in the early phase. But in some cases where you want to "grow" more parts based on the initialized shape, it may fail to generate all parts. For instance, the text "An astronaut riding a horse", may fail to "grow" the part of the astronaut using the range [0.02, 0.5] since the fact that low time steps have little contribution to significant deformation. To address this problem, we recommend you use a high range, such as [0.4, 0.6]. You can try different ranges and publish your findings in the issue.

  • (geometry modeling) Rotate the object. Rotating the object according to the actual situation can alleviate janus-problem or help the network in mode-seeking. For example, when generating a human head statue, rotate the initialized ellipsoid around the x-axis by some angle to match the situation where the back of the person's head has some curvature.

  • (geometry modeling) Fine-tune the input mesh. Under the task of user-guided generation, if you are satisfied with the silhouette of the input mesh and just want to increase the details of the geometry, set the parameter "coarse_iter" to 400. This setting would directly enter the late phase of geometry modeling which reinforces the local geometric details of the input shape.

  • (appearance modeling) Use different strategy. We offer three strategy (0 or 1 or 2) to optimize the appearance by setting the parameter "sds_weight_strategy". For strategy 0, there will be stronger light and shadow changes, representing a more realistic final appearance. For strategy 1 or 2, the final appearance will be smoother and more comfortable. If the target appearance is too simple, such as "a highly detailed stone bust of Theodoros Kolokotronis", "A standing elephant", and "Michelangelo style statue of dog reading news on a cellphone", using strategy 0 may lead to an oversaturated appearance and strange color. In this case, strategy 1 or 2 can generate more natural color than strategy 0.

    strategy 0 can be used as follow.

    "sds_weight_strategy": 0,
    "early_time_step_range": [0.02, 0.98],
    "late_time_step_range": [0.02, 0.5]

    or

    "sds_weight_strategy": 0,
    "early_time_step_range": [0.02, 0.98],
    "late_time_step_range": [0.02, 0.98]

    strategy 1 can be used as follow:

    "sds_weight_strategy": 1,
    "early_time_step_range": [0.02, 0.98],
    "late_time_step_range": [0.02, 0.7]

    or

    "sds_weight_strategy": 1,
    "early_time_step_range": [0.02, 0.98],
    "late_time_step_range": [0.02, 0.98]

    strategy 2 can be used as follow:

    "sds_weight_strategy": 2,
    "early_time_step_range": [0.02, 0.98],
    "late_time_step_range": [0.02, 0.98]
  • (appearance modeling) Use different HDR environment maps. Learning the PBR materials is an ill-posed problem. If materials and lighting are learned together, it will increase the difficulty of learning. So We use the fixed HDR light to optimize the appearance. We noticed that HDR maps with uniform brightness distribution, such as cloudy days, are conducive to the uniformity of appearance colors. Some uneven brightness distribution may produce more realistic results (untested).

Coordinate System

Demos

You can download and watch some demos' training process in Google drive

baby_bunny_geometry.mp4
bunny_appearance_strategy0.mp4
dog_appearance_stratege0.mp4
dog_geometry.mp4
sandcastle_geometry.mp4
sandcastle_appearacne_strategy0.mp4
sandcastle_appearance_strategy1.mp4
sandcastle_appearance_strategy2.mp4
Gundam_geometry.mp4
Gundam_appearance.mp4
mug_geometry.mp4
mug_appearance.mp4
Einstein.mp4
pineapple_geometry.mp4
pineapple_appearance.mp4
car_geometry.mp4
car_appearance_strategy0.mp4
car_appearance_strategy1.mp4
elephant_geometry.mp4

Todo

  • Release the code. (2023.06.15)
  • Support the gradient accumulation technique for single GPU training.
  • Support the VSD loss proposed by ProlificDreamer.

Acknowledgement

BibTex

@article{chen2023fantasia3d,
    title={Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation},
    author={Rui Chen and Yongwei Chen and Ningxin Jiao and Kui Jia},
    journal={arXiv preprint arXiv:2303.13873},
    year={2023}
}

More Repositories

1

frustum-convnet

The PyTorch Implementation of F-ConvNet for 3D Object Detection
Python
239
star
2

tango

[NeurIPS 2022] Official code repository for "TANGO: Text-driven Photorealistic and Robust 3D Stylization via Lighting Decomposition"
Python
141
star
3

VISTA

This repo presents you the official code of "VISTA: Boosting 3D Object Detection via Dual Cross-VIew SpaTial Attention"
Python
126
star
4

DADA-AAAI2020

Code release for Discriminative Adversarial Domain Adaptation (AAAI2020).
Python
117
star
5

SSTNet

Instance Segmentation in 3D Scenes using Semantic Superpoint Tree Networks
Python
97
star
6

HelixSurf

official implementation of "HelixSurf: A Robust and Efficient Neural Implicit Surface Learning of Indoor Scenes with Iterative Intertwined Regularization"
Python
92
star
7

SymNets

The official project for CVPR19 paper: Domain-Symmetric Networks for Adversarial Domain Adaptation
Python
85
star
8

SRDC-CVPR2020

Code release for Unsupervised Domain Adaptation via Structurally Regularized Deep Clustering (CVPR2020-Oral).
Python
80
star
9

SkeletonBridgeRecon

The code for CVPR2019 Oral paper "A Skeleton-bridged Deep Learning Approach for Generating Meshes of Complex Topologies from Single RGB Images"
Python
78
star
10

AffordanceNet

Python
71
star
11

MultiClassDA

TPAMI2020 "Unsupervised Multi-Class Domain Adaptation: Theory, Algorithms, and Practice"
Python
71
star
12

AnalyticMesh

An Efficient Implementation of Analytic Mesh Algorithm for 3D Iso-surface Extraction from Neural Networks
C++
71
star
13

Visual-Auditory-Fusion-Perception

ๅนฟไธœ็œโ€œ็ ๆฑŸไบบๆ‰่ฎกๅˆ’โ€โ€”โ€”ๆœๅŠกๆœบๅ™จไบบๆ™บ่ƒฝๅผ•ๆ“Žๅนณๅฐ
Python
56
star
14

SkeletonNet

Code and datasets for TPAMI 2021 "SkeletonNet: A Topology-Preserving Solution for Learning Mesh Reconstruction of Object Surfaces from RGB Images "
C++
46
star
15

TTAC

[NeurIPS 2022] Revisiting Realistic Test-Time Training: Sequential Inference and Adaptation by Anchored Clustering
Python
42
star
16

DualPoseNet

Code for "DualPoseNet: Category-level 6D Object Pose and Size EstimationUsing Dual Pose Network with Refined Learning of Pose Consistency"
Python
40
star
17

SCUTSurface-code

Python
37
star
18

SS-Conv

Code for "Sparse Steerable Convolutions: An Efficient Learning of SE(3)-Equivariant Features for Estimation and Tracking of Object Poses in 3D Space"
Python
32
star
19

LPDC-Net

CVPR2021 paper "Learning Parallel Dense Correspondence from Spatio-Temporal Descriptorsfor Efficient and Robust 4D Reconstruction"
Python
29
star
20

GPNet

Python
28
star
21

MetaFGNet

The source code of the ECCV 2018 paper: Fine-Grained Visual Categorization using Meta-Learning Optimization with Sample Selection of Auxiliary Data
Python
26
star
22

GeoA3

Code for Geometry-Aware Generation of Adversarial Point Clouds
Python
26
star
23

Label-Propagation-with-Augmented-Anchors

A2LP for short, ECCV2020 spotlight, Investigating SSL principles for UDA problems
Python
21
star
24

PartNet

The source code for the TMM paper: Part-Aware Fine-grained Object Categorization using Weakly Supervised Part Detection Network
Python
20
star
25

TRIBE

[AAAI 2024] Towards Real-World Test-Time Adaptation: Tri-Net Self-Training with Balanced Normalization
Python
20
star
26

DCL-Net

DCL-Net: Deep Correspondence Learning Network for 6D Pose Estimation
Python
17
star
27

BiCo-Net

Code for "BiCo-Net: Regress Globally, Match Locally for Robust 6D Pose Estimation"
Python
16
star
28

QS3

The official implementation for ECCV22 paper: Quasi-Balanced Self-Training on Noise-Aware Synthesis of Object Point Clouds for Closing Domain Gap
Python
12
star
29

OrthDNNs

Code for OrthDNNs: Orthogonal Deep Neural Networks
Python
11
star
30

UB2DA

This repository provides code for the paper ---- On Universal Black-Box Domain Adaptation.
Python
11
star
31

gorilla-core

Python
8
star
32

raycastmesh

ray cast mesh to get normal, depth and face_ids
Cuda
7
star
33

gorilla-3d

Python
7
star
34

gmvs

This repo is a module for PatchMatch Stereo
Cuda
5
star
35

TTAC2

[TPAMI 2024] The official implementation of "Revisiting Realistic Test-Time Training: Sequential Inference and Adaptation by Anchored Clustering Regularized Self-Training"
Python
5
star
36

MAST

[IJCAI 2023] Manifold-Aware Self-Training for Unsupervised Domain Adaptation on Regressing 6D Object Pose Installation
Python
4
star
37

GPNetPP

2
star