Deep learning method for emotion recognition through the Fusion of body and context features

Type: MA thesis

Status: finished

Date: January 2, 2022 - July 4, 2022

Supervisors: Luis Carlos Rivera Monroy, M. Sc., Ronak Kosti, Andreas Maier, Dr. rer. nat. Patrick Krauß


Emotion recognition currently has a wide range of applications in various fields. For example, doctors can use emotion recognition to take better care of patients, or teachers can use emotion recognition to judge the concentration level of students, etc. Previous visual emotion recognition researches have mainly focused on facial expression. Still, it has not achieved good results in unconstrained scenarios. This thesis presents a network for emotion recognition based on posture, body features, and context information; the aim is to improve emotion recognition accuracy in different unconstrained scenarios.

The network has been designed in a three-branch architecture, including three feature streams: body, skeleton and context streams. The extraction of human key points from images explored three different networks: Openpose[1], Alphapose[2] and Mediapipe[3]. This thesis uses GCN to extract features from key points. Body and context feature extraction using Transfomer Vision and Resnet50. The three streams are fused to predict dimensional emotion representation, valence, arousal, and dominance. The experiment is trained on public datasets (EMOTIC datasets[4]), and the verification of the model is on the CAER-S datasets[5]. Experimental results show that the proposed method effectively integrates emotional information expressed by body and context and has good generalization ability and applicability. Considering body poses to recognize people’s emotions provides a new benchmark for emotion recognition in visual poses.


[1] Cao, Zhe, et al. “Realtime multi-person 2d pose estimation using part affinity fields.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

[2]  Fang, Hao-Shu, et al. “Rmpe: Regional multi-person pose estimation.” Proceedings of the IEEE international conference on computer vision. 2017.

[3] Lugaresi, Camillo, et al. “Mediapipe: A framework for building perception pipelines.” arXiv preprint arXiv:1906.08172 (2019).

  • Kosti, Ronak, et al. “Context based emotion recognition using emotic dataset.” IEEE transactions on pattern analysis and machine intelligence11 (2019): 2755-2766.

[5]  Lee, Jiyoung, et al. “Context-aware emotion recognition networks.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.