Multimodal Gesture Classification in Artwork Images

This thesis addresses the challenge of gesture classification in artwork images specifically on the SniffyArt dataset[1]. Traditional classification methods fall short due to the change in domain, limited dataset size, class imbalance, and the difficulty of discriminating between different smell gestures. The thesis tackles this challenge by exploring multimodal learning techniques, specifically leveraging bounding box and keypoint information and their fusion to provide a richer contextual understanding of the classification network.

Objectives:

Literature Review: Conduct an in-depth review of existing multimodal learning techniques, with a focus on methodologies utilizing both bounding box and keypoint information such as ED-pose[2], UniPose[3], PRTR [4] among many others

Model Design: Add a specialized classifier which takes the whole image context, person box and keypoint features obtained from one of the methods from the literature ED-pose and performs gesture classification.

Model Evaluation: Evaluate the performance of the proposed model against all modalities i.e. person detection, pose estimation and gesture classification, and their combination.

Baseline Results: Create baseline results for box detection, pose estimation and gesture classification using: 1) separate standard models for each of these modalities, and 2) train the selected method from the literature review directly for gesture boxes i.e. without a specialized classifier.

Aside from separate evaluation of the subtasks, evaluate the full pipeline, i.e. classification performance of the whole image when both bounding box and keypoint information are unavailable.

Optional Tasks: Incorporating text prompts as an additional modality information as in UniPose.

[1] Zinnen, M., Hussian, A., Tran, H., Madhu, P., Maier, A., & Christlein, V. (2023, November). SniffyArt: The Dataset of Smelling Persons. In Proceedings of the 5th Workshop on analySis, Understanding and proMotion of heritAge Contents (pp. 49-58).

[2] Yang, J., Zeng, A., Liu, S., Li, F., Zhang, R., & Zhang, L. (2023). Explicit box detection unifies end-to-end multi-person pose estimation. arXiv preprint arXiv:2302.01593..

[3] Yang, J., Zeng, A., Zhang, R., & Zhang, L. (2023). Unipose: Detecting any keypoints. arXiv preprint arXiv:2310.08530.

[4] Li, K., Wang, S., Zhang, X., Xu, Y., Xu, W., & Tu, Z. (2021). Pose recognition with cascade transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1944-1953).