Multimodal fusion of pose and visual information for gesture recognition in historical artworks

Gestures in historical artwork can communicate the underlying human experiences, offering a broad outlook on the past sensory worlds. To explore this domain, we use the SensoryArt [1] – a dataset of multisensory gestures in historical artworks that comes with person pose estimation key points and gesture labels. The goal of the thesis is to perform gesture classification of the persons’ actions depicted on the paintings. We aim to investigate how additional information on the body posture, such as annotated skeleton information, can affect the performance of the models.

Mandatory Goals:

Train a model for a multi-label gesture classification on the cropped images with fused ground truth heatmaps of the SensoryArt dataset + evaluate on validation split.
Selection and training of a well-performing keypoint estimation model.
Evaluate the performance of the end-to-end pipeline on the cropped images consisting of predicting the heatmaps first and then classifying.
Train another model for a multiperson gesture classification problem on the image level with fused ground truth heatmaps of the uncropped images + evaluate on validation split.
Perform an inference test of the model on original images with machine-generated heatmaps.

Optional Goals:

Test incorporating additional information on body position, not as heatmaps but as skeleton key point coordinates/angles.
Conduct additional ablations such as cropping the humans out of the images in square size.
Integrate a multi-label approach into the detection pipeline.
Test human pose estimation on artwork using the additionally provided gesture labels.

[1] Zinnen, M., Christlein, V., Maier, A., & Hussian, A. (2024). SensoryArt (1.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10889613