• Stars
    star
    116
  • Rank 303,894 (Top 6 %)
  • Language
    Python
  • License
    Other
  • Created almost 2 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Source code of the paper Scene-Aware 3D Multi-Human Motion Capture, EUROGRAPHICS 2023

Scene-Aware 3D Multi-Human Motion Capture

This software is released as part of the supplementary material of the paper:

Scene-Aware 3D Multi-Human Motion Capture from a Single Camera, EUROGRAPHICS 2023
Diogo C. Luvizon | Marc Habermann | Vladislav Golyanik | Adam Kortylewski | Christian Theobalt
Project: https://vcai.mpi-inf.mpg.de/projects/scene-aware-3d-multi-human
Code: https://github.com/dluvizon/scene-aware-3d-multi-human

Abstract

In this work, we consider the problem of estimating the 3D position of multiple humans in a scene as well as their body shape and articulation from a single RGB video recorded with a static camera. In contrast to expensive marker-based or multi-view systems, our lightweight setup is ideal for private users as it enables an affordable 3D motion capture that is easy to install and does not require expert knowledge. To deal with this challenging setting, we leverage recent advances in computer vision using large-scale pre-trained models for a variety of modalities, including 2D body joints, joint angles, normalized disparity maps, and human segmentation masks. Thus, we introduce the first non-linear optimization-based approach that jointly solves for the 3D position of each human, their articulated pose, their individual shapes as well as the scale of the scene. In particular, we estimate the scene depth and person scale from normalized disparity predictions using the 2D body joints and joint angles. Given the per-frame scene depth, we reconstruct a point-cloud of the static scene in 3D space. Finally, given the per-frame 3D estimates of the humans and scene point-cloud, we perform a space-time coherent optimization over the video to ensure temporal, spatial and physical plausibility. We evaluate our method on established multi-person 3D human pose benchmarks where we consistently outperform previous methods and we qualitatively demonstrate that our method is robust to in-the-wild conditions including challenging scenes with people of different sizes.

Overview of the method

1. Installation

1.1 HW/SW Requirements

This software was tested on the following systems:

Operating System: Debian GNU/Linux 10; Ubuntu 20.04.5 LTS
GPU: TITAN V 12Gi; Quadro RTX 8000 48Gi
CPU-only is also supported (very slow)
CPU-RAM: 11 Gi
Python 3 and (Mini)Conda

1.2 Minimal setup

A minimal installation is possible by simply creating a new conda environment. This assumes that the input data modalities are pre-computed and available.

1.2.1 Create a conda environment

conda env create -f environment.yml
conda activate multi-human-mocap

Note that some packages in environment.yml are only needed for visualizations.

1.2.2 Download Human Model and Body Joint Regressors

We use the SMPL model that can be downloaded from [ here ]. Download the file SMPL_NEUTRAL.pkl and place it at model_data/parameters.

1.3 Pre-processing and Image Predictors

<This step is only required for predictions from new videos>

Our method relies on four different predictors as input data modalities. Please follow the optional instructions [ here ] to install and setup each predictor. As an alternative, one could also install the predictors by their own. Here is a list of each predictor that we use:

All the predictors are independent and can be installed and executed in parallel. Note that all these predictors are not part of our software distribution, although we provide simplified instructions on how to install and adapt, if necessary.

2. Evaluation on MuPoTs-3D

2.1 First, download MuPoTs-3D and rearrange the data:

mkdir -p data && cd data
wget -c http://gvv.mpi-inf.mpg.de/projects/SingleShotMultiPerson/MultiPersonTestSet.zip
unzip MultiPersonTestSet.zip
for ts in {1..20}; do
  mkdir -p mupots-3d-eval/TS${ts}/images
  mv MultiPersonTestSet/TS${ts}/* mupots-3d-eval/TS${ts}/images/
done
rm -r MultiPersonTestSet

2.2 For the sequences from MuPoTs-3D, we provided the pre-processed data required to run our software. Please download it from [ here ] and place the file in data/. Then, extract it with:

tar -jxf mhmc_mupots-3d-eval.tar.bz2

After this, the folder data/mupots-3d-eval should have the follow structure:

  |-- data/mupots-3d-eval/
      |-- TS1/
          |-- AlphaPose/
              |-- alphapose-results.json
          |-- DPT_large_monodepth/
              |-- img_%06d.png
          |-- images/
              |-- img_%06d.jpg
              |-- annot.mat
              |-- intrinsics.txt
              |-- occlusion.mat
          |-- Mask2Former_Instances/
              |-- img_%06d.png
          |-- ROMP_Predictions/
              |-- img_%06d.npz
      |-- TS2/
          [...]

2.3 With the minimal installation done and the MuPoTs-3D data is ready, run:

./script/predict_mupots_full.sh

This runs the optimization in the full dataset (TS1..TS20) and can take a long time, depending on the hardware config. For a quick test of the software, run:

./script/predict_mupots_test.sh # TS1 only, only a few iterations

After running the prediction part for the full dataset, the output will be stored in ./output.

2.4 Compute the scores from the predicted outputs.

./script/eval_mupots.sh
cat output/mupots/FinalResults.md # show our results

3. Predictions from New Videos

Processing new videos requires all the predictors to be installed. If not done yet, please follow this step. For a new video, please extract it (using ffmpg or a similar tool) to [path_to_video]/images/img_%06d.jpg.

3.1 Preprocess video frames

Run the script:

# path_to_video="path-to-your-video-file"
./script/preproc_data.sh ${path_to_video}

After this, the folder [path_to_video]/ should contain all the preprocessed outputs (depth maps, 2d pose, SMPL parameters, segmentation).

3.2 Run our code

./script/predict_internet.sh ${path_to_video} ${path_to_video}/output

This script calls mhmocap.predict_internet.py, which assumes a standard camera with FOV=60. Modify it if you need to properly set the camera intrinsics.

4. Iterative 3D Visualization

We provide a visualization tool based on Open3D. This module can show our predictions in 3D for a video sequence in an interactive way:

python -m mhmocap.visualization \
  --input_path="${path_to_video}/output" \
  --output_path="${path_to_video}/output"
Type "n" to jump to the next frame and "u" to see it from the camera perspective.

Citation

Please cite our paper if this software (including any part of it) is useful for you.

@article{SceneAware_EG2023,
  title = {{Scene-Aware 3D Multi-Human Motion Capture from a Single Camera}},
  author = {Luvizon, Diogo and Habermann, Marc and Golyanik, Vladislav and Kortylewski, Adam and Theobalt, Christian},
  journal = {Computer Graphics Forum},
  volume = {42},
  number = {2},
  pages = {371-383},
  doi = {https://doi.org/10.1111/cgf.14768},
  year = {2023},
}

License

Please see the License Terms in the LICENSE file.

Acknowledgment

This work was funded by the ERC Consolidator Grant 4DRepLy (770784).

Some parts of this code were borrowed from many other great repositories, including ROMP, VIBE, and more. We also thank Rishabh Dabral for his special help in the animated characters with Blender.