Human Pose Estimation Benchmarking and Action Recognition
Deep Learning Project, Winter 2019, Northwestern University
Group members: Chenge Yang, Zhicheng Yu, Feiyu Chen
Results
1. Human Pose Estimation Benchmarking
Multi-Person (left: AlphaPose, right: OpenPose)
Single-Person (left: AlphaPose, right: OpenPose)
2. Action Recognition
Introduction
This project contain two main parts:
1. Human Pose Estimation Benchmarking
In this part, we conducted benchmarking test on the two most state-of-the-art human pose estimation models OpenPose and AlphaPose. We tested different modes on both single-person and multi-person scenarios.
2. Online Skeleton-Based Action Recognition
Real-time multi-person human action recognition based on tf-pose-estimation. The pipeline is as follows:
- Real-time multi-person pose estimation via tf-pose-estimation
- Feature Extraction
- Multi-person action recognition using TensorFlow / Keras
Dependencies and Installation
1. Human Pose Estimation Benchmarking
Check the installation_benchmarking.md.
2. Online Skeleton-Based Action Recognition
Check the installation_action_recognition.md.
Usage
Human Pose Estimation Benchmarking
- Check the command_benchmarking.md for the running commands we used.
- Check the official website of OpenPose and AlphaPose for more detailed running options.
Training Action Recognition Model
- Copy your dataset (must be .csv file) into /data folder
- Run training.py with the following command:
python3 src/training.py --dataset [dataset_filename]
- The model is saved in /model folder
Real-time Action Recognition
- To see our multi-person action recognition result using your webcam, run run_detector.py with the following command:
python3 src/run_detector.py --images_source webcam
Benchmarking Results
Requirements
- 0S: Ubuntu 18.04
- CPU: AMD Ryzen Threadripper 1920X (12-core / 24-thread)
- GPU: Nvidia GTX 1080Ti - 12 GB
- RAM: 64GB
- Webcam: Creative 720p Webcam
1. Multi-person
Benchmark on a 1920x1080 video with 902 frames, 30fps
2. Single-person
Benchmark on a 1920x1080 video with 902 frames, 30fps
Implementation (Action Recognition)
Collecting training data
we collected 3916 training images from our laptop's webcam for training the model and classifying five actions: squat, stand, punch, kick, and wave. In each training image, there is only one person doing one of these 5 actions. The videos are recorded at 10 fps with a frame size of 640 x 480, and then saved to images.
The examples and the numbers of training images for each action class are shown below:
squat | stand | punch | kick | wave |
---|---|---|---|---|
Get Skeleton from Image
We used tf-pose-estimation to detect the human pose in each training image. The output skeleton format of OpenPose can be found at OpenPose Demo - Output.
The generated training data files are located in data folder:
- skeleton_raw.csv: original data
- skeleton_filtered.csv: filtered data where incomplete poses are eliminated
Feature Extraction
To transfer the original skeleton data into the input of our neural network, three features are extracted, which are implemented in data_preprocessing.py :
- Head reference: all joint positions are converted to the x-y coordinates relative to the head joint.
- Pose to angle: the 18 joint positions are converted to 8 joint angles: left / right shoulder, left / right elbow, left / right hip, left / right knee.
- Normalization: all joint positions are converted to the x-y coordinates relative to the skeleton bounding box.
The third feature is used, which gives the best result and robustness.
Deep Learning model
We built our Deep Learning model refering to Online-Realtime-Action-Recognition-based-on-OpenPose. The model is implemented in training.py using Keras and Tensorflow. The model consists of three hidden layers and a Softmax output layer to conduct a 5-class classification.
The generated model is saved in model folder.