HaGRID - HAnd Gesture Recognition Image Dataset

We introduce a large image dataset HaGRID (HAnd Gesture Recognition Image Dataset) for hand gesture recognition (HGR) systems. You can use it for image classification or image detection tasks. Proposed dataset allows to build HGR systems, which can be used in video conferencing services (Zoom, Skype, Discord, Jazz etc.), home automation systems, the automotive sector, etc.

HaGRID size is 716GB and dataset contains 552,992 FullHD (1920 × 1080) RGB images divided into 18 classes of gestures. Also, some images have no_gesture class if there is a second free hand in the frame. This extra class contains 123,589 samples. The data were split into training 92%, and testing 8% sets by subject user_id, with 509,323 images for train and 43,669 images for test.

The dataset contains 34,730 unique persons and at least this number of unique scenes. The subjects are people from 18 to 65 years old. The dataset was collected mainly indoors with considerable variation in lighting, including artificial and natural light. Besides, the dataset includes images taken in extreme conditions such as facing and backing to a window. Also, the subjects had to show gestures at a distance of 0.5 to 4 meters from the camera.

Example of sample and its annotation:

For more information see our arxiv paper HaGRID - HAnd Gesture Recognition Image Dataset.

Installation

Clone and install required python packages:

git clone https://github.com/hukenovs/hagrid.git
# or mirror link:
cd hagrid
# Create virtual env by conda or venv
conda create -n gestures python=3.9 -y
conda activate gestures
# Install requirements
pip install -r requirements.txt

Docker Installation

docker build -t gestures .
docker run -it -d -v $PWD:/gesture-classifier gestures

Downloads

We split the train dataset into 18 archives by gestures because of the large size of data. Download and unzip them from the following links:

Tranval

Gesture	Size	Gesture	Size
`call`	39.1 GB	`peace`	38.6 GB
`dislike`	38.7 GB	`peace_inverted`	38.6 GB
`fist`	38.0 GB	`rock`	38.9 GB
`four`	40.5 GB	`stop`	38.3 GB
`like`	38.3 GB	`stop_inverted`	40.2 GB
`mute`	39.5 GB	`three`	39.4 GB
`ok`	39.0 GB	`three2`	38.5 GB
`one`	39.9 GB	`two_up`	41.2 GB
`palm`	39.3 GB	`two_up_inverted`	39.2 GB

train_val annotations: ann_train_val

Test

Test	Archives	Size
images	`test`	60.4 GB
annotations	`ann_test`	27.3 MB

Subsample

Subsample has 100 items per gesture.

Subsample	Archives	Size
images	`subsample`	2.5 GB
annotations	`ann_subsample`	1.2 MB

HaGRID 512px - lightweight version of the full dataset with max_side = 512p

or by using python script

python download.py --save_path <PATH_TO_SAVE> \
                   --train \
                   --test \
                   --subset \
                   --annotations \
                   --dataset

Run the following command with key --subset to download the small subset (100 images per class). You can download the train subset with --trainval or test subset with --test. Download annotations for selected stage by --annotations key. Download dataset with images by --dataset.

usage: download.py [-h] [--train] [--test] [--subset] [-a] [-d] [-t TARGETS [TARGETS ...]] [-p SAVE_PATH]

Download dataset...

optional arguments:
  -h, --help            show this help message and exit
  --train               Download trainval set
  --test                Download test set
  --subset              Download subset with 100 items of each gesture
  -a, --annotations     Download annotations
  -d, --dataset         Download dataset
  -t TARGETS [TARGETS ...], --targets TARGETS [TARGETS ...]
                        Target(s) for downloading train set
  -p SAVE_PATH, --save_path SAVE_PATH
                        Save path

Models

We provide some pre-trained models as the baseline with the classic backbone architectures and two output heads - for gesture classification and leading hand classification.

Classifiers	F1 Gestures	F1 Leading hand
ResNet18	98.80	98.80
ResNet152	99.04	98.92
ResNeXt50	98.95	98.87
ResNeXt101	99.16	98.71
MobileNetV3_small	96.50	97.31
MobileNetV3_large	98.03	97.99
Vitb32	98.35	98.63
Lenet	84.58	91.16

Also we provide some models to solve hand detection problem.

Detector	mAP
SSDLiteMobileNetV3Large	71.49
SSDLiteMobileNetV3Small	53.38
FRCNNMobilenetV3LargeFPN	78.05
YoloV7Tiny	81.1

However, if you need a single gesture, you can use pre-trained full frame classifiers instead of detectors. To use full frame models, set the configuration parameter full_frame: True and remove the no_gesture class

Full Frame Classifiers	F1 Gestures
ResNet18	93.51
ResNet152	94.49
ResNeXt50	95.20
ResNeXt101	95.67
MobileNetV3_small	87.09
MobileNetV3_large	90.96

Train

You can use downloaded trained models, otherwise select a classifier and parameters for training in default.yaml. To train the model, execute the following command:

python -m classifier.run --command 'train' --path_to_config <PATH>

python -m detector.run --command 'train' --path_to_config <PATH>

Every step, the current loss, learning rate and others values get logged to Tensorboard. See all saved metrics and parameters by opening a command line (this will open a webpage at localhost:6006):

tensorboard --logdir=experiments

Test

Test your model by running the following command:

python -m classifier.run --command 'test' --path_to_config <PATH>

python -m detecotr.run --command 'test' --path_to_config <PATH>

Demo

python demo.py -p <PATH_TO_CONFIG> --landmarks

Demo Full Frame Classifiers

python demo_ff.py -p <PATH_TO_CONFIG> --landmarks

Annotations

The annotations consist of bounding boxes of hands in COCO format [top left X position, top left Y position, width, height] with gesture labels. Also, annotations have 21 landmarks in format [x,y] relative image coordinates, markups of leading hands (left or right for gesture hand) and leading_conf as confidence for leading_hand annotation. We provide user_id field that will allow you to split the train / val dataset yourself.

"0534147c-4548-4ab4-9a8c-f297b43e8ffb": {
  "bboxes": [
    [0.38038597, 0.74085361, 0.08349486, 0.09142549],
    [0.67322755, 0.37933984, 0.06350809, 0.09187757]
  ],
  "landmarks"[
    [
      [
        [0.39917091, 0.74502739],
        [0.42500172, 0.74984396],
        ...
      ],
        [0.70590734, 0.46012364],
        [0.69208878, 0.45407018],
        ...
    ],
  ],
  "labels": [
    "no_gesture",
    "one"
  ],
  "leading_hand": "left",
  "leading_conf": 1.0,
  "user_id": "bb138d5db200f29385f..."
}

Key - image name without extension
Bboxes - list of normalized bboxes [top left X pos, top left Y pos, width, height]
Labels - list of class labels e.g. like, stop, no_gesture
Landmarks - list of normalized hand landmarks [x, y]
Leading hand - right or left for hand which showing gesture
Leading conf - leading confidence for leading_hand
User ID - subject id (useful for split data to train / val subsets).

Bounding boxes

Object	Train + Val	Test	Total
gesture	~ 28 300	~ 2 400	30 629
no gesture	112 740	10 849	123 589
total boxes	622 063	54 518	676 581

Landmarks

We annotate 21 hand keypoints by using MediaPipe open source framework. Due to auto markup empty lists may be present in landmarks.

Object	Train + Val	Test	Total
leading hand	503 872	43 167	547 039
not leading hand	98 766	9 243	108 009
total landmarks	602 638	52 410	655 048

Converters

Yolo

We provide a script to convert annotations to YOLO format. To convert annotations, run the following command:

python -m converters.hagrid_to_yolo --path_to_config <PATH>

after conversion, you need change original definition img2labels to:

def img2label_paths(img_paths):
    img_paths = list(img_paths)
    # Define label paths as a function of image paths
    if "subsample" in img_paths[0]:
        return [x.replace("subsample", "subsample_labels").replace(".jpg", ".txt") for x in img_paths]
    elif "train_val" in img_paths[0]:
        return [x.replace("train_val", "train_val_labels").replace(".jpg", ".txt") for x in img_paths]
    elif "test" in img_paths[0]:
        return [x.replace("test", "test_labels").replace(".jpg", ".txt") for x in img_paths]

Coco

Also, we provide a script to convert annotations to Coco format. To convert annotations, run the following command:

python -m converters.hagrid_to_coco --path_to_config <PATH>

@article{hagrid,
    title={HaGRID - HAnd Gesture Recognition Image Dataset},
    author={Kapitanov, Alexander and Makhlyarchuk, Andrey and Kvanchiani, Karina},
    journal={arXiv preprint arXiv:2206.08219},
    year={2022}
}

hukenovs/hagrid

hukenovs

Reviews

Repository Details

HaGRID - HAnd Gesture Recognition Image Dataset

Installation

Docker Installation

Downloads

Tranval

Test

Subsample

Models

Train

Test

Demo

Demo Full Frame Classifiers

Annotations

Bounding boxes

Landmarks

Converters

License

Authors and Credits

Links

Citation

More Repositories