Ahmed Gomaa “Enhancing the robustness and efficiency of multimodal emotion estimation models” (MT Intro)
It is a pleasure to announce the invited talk by Snehal Bhayani, currently pursuing his doctoral studies from Center for Machine Vision and Signal Analysis at University of Oulu. Mr. Bhayani has recently published his work on minimal solvers at CVPR 2020 and is going to talk about it during the session. Please, find the details below:
The content of the talk has applications with algorithms in Computer Vision where one needs to solve camera geometry, usually as a part of RANSAC-like loops. So the applications can be reconstruction, augmented reality, camera calibration, structure from motion, etc. The idea is to model geometry problems by using minimum amount of information. Many computer algebra software can be used to solve such problems accurately and efficiently. Solving such problems is usually divided into two stages, Offline and Online. Offline stage involves the more time consuming operations that are executed once per type of problem, e.g. relative pose (stereo) for calibrated cameras can be solved from 5-point correspondences. Our approach compares against the S.O.T.A. in terms of speed and accuracy. The underlying tools are inspired from algebraic geometry.
Written by Prathmesh Madhu
Inspired from the pioneer work of Max Imdahl , our work focuses on generating image composition canvas (ICC) diagrams based on two central themes: (a) detection of action regions and action lines of an artwork (b) pose-based segmentation of foreground and background. In order to validate our approach qualitatively and quantitatively, we conduct a user study involving experts and non-experts. The outcome of the study highly correlates with our approach and also demonstrates its domain-agnostic capability.
- Proof of Concept for visual attention and similarity at how artworks are interpreted by humans and machines.
- Can help art historians towards their sophisticated art analysis. This helps in saving lot of time.
- A step towards understanding scenes without deep learning. This means not using annotated data; further using interpretable features for image retrieval.
- Ways to exploit existing pre-trained models and methods in computer vision for better interpretation of scenes
- Currently, our approach works only for artworks comprising of protagonists (persons) in an image
- No benchmark data-sets available and no evaluation metric exists, so quantitative evaluation is very difficult.
- Our approach does not use state-of-the art gaze detection method, hence the gazes are not very precise.
- No working applications shown so far
Overview of the Methodology
Authors: Prathmesh Madhu*, Tilman Marquart*, Ronak Kosti, Peter Bell, Andreas Maier, Vincent Christlein (* represents equal contribution)
Preprint: ArXiv Link (https://arxiv.org/abs/2009.03807)
ACM Published Link : Accepted, VISART, 2020 (Link coming soon)
a) action lines : Global Action line (AL) is the line that passes through the main activity in the scene. This line normally is also aligned with the central protagonists. Local Action Line or Pose Line (PL) are the lines that represent the poses of the protagonists.
b) action regions : Action Regions (AR) is (are) the main region(s) of interest. More often than not it is the region where gazes of all the protagonists theoretically meet.
Our approach uses a pre-trained OpenPose  network, image processing techniques and a modified k-means clustering method. The pipeline of the proposed algorithm is shown in above figure.
Our method consists of two main branches: (1) a detector for action lines and action regions and (2) the foreground/background separator.
We use the estimated poses to detect pose triangles. We then propose a simple technique (gaze cones) to obtain gaze directions estimates without involving any training or fine-tuning. Combining this information, we draw a final ICC that estimates the image composition of the given scene under study.
- Quantitatively, we proposed using HausDorff (HD) distance as an evaluation metric between lines and Euclidean distance between action regions.
- HD between all the annotators and ICC (ALL/ICC) is quite low (38 %) than the worst HD distance hence showing that the ALs of our ICC have very good correlation to all annotators.
- We observed that when the annotators’ agreement for the position of AR was higher / lower, our method predicted AR closer / farther to the labeled ones
- Qualitative comparison with User Study (Experts + Non-Experts)
- Cross Domain Adaptability
 Imdahl, M.: Giotto, Arenafresken: Ikonographie-Ikonologie-Ikonik. Wilhelm Fink (1975)
 Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: OpenPose: Realtime Multi-Person 2D Pose Estimation using Part A_nity Fields. arXiv:1812.08008 [cs] (May 2019)
 Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C.L., Doll_ar, P.: Microsoft COCO: Common Objects in Context. arXiv:1405.0312 [cs] (Feb 2015)
Written by Prathmesh Madhu
This work exploits style transfer in combination with transfer learning to recognize characters in art historical images. Our approach focuses on recognizing two central characters in the “Annunciation of the Lord” scene from Art history, Mary and Gabriel across varied art-works from different artists, times and styles. This is different compared to the existing methods in art history, where the models are trained on artworks relating to either a specific style or an artist.
- understanding the semantics to recognize characters in art historical images irrespective of styles and times.
- proof of concept for enhancing recognition models using style-transfer; information from styled images is beneficial for improving the performance of the deep CNN models.
- Validating the claim that our method captures more contextual and semantic information by visualizing the class-activation maps; also to better understand the model performances and for further enhancements.
- Performance of traditional machine learning models (SVMs, Decision Trees) decrease when they are trained on the whole body of the characters as opposed to only faces
- It’s not an end-to-end pipeline.
- The current algorithm is only trained for 2 characters – Mary and Gabriel. We are already working on increasing the number of characters.
- Since, there is no existing method for art history for comparison, there is no comparison with any SOTA methods.
Overview of the Methodology
Authors: Prathmesh Madhu, Ronak Kosti, Lara Mührenberg, Peter Bell, Andreas Maier, Vincent Christlein
ACM Published Link : https://dl.acm.org/doi/10.1145/3347317.3357242
We train VGGFace model on Mary and Gabriel body crops which is called VGGFace-A (baseline). We then train another VGGFace on Styled Dataset and call it VGGFace-B. We then finetune VGGFace-B by Mary and Gabriel body crops and call it VGGFace-C which is the Styled transfer learned model.
Generating Styled Dataset
Content Images : We chose to use IMDB-Wiki  and Adience  since the images are not only limited to faces, but also contain the upper body part. Combining both, we have around 20000 images belonging to each class, male and female, which we call the content images
Style Images : Our annunciation dataset (from which Mary and Gabriel body crops are generated) has 2787 images which we call the style images.
Using a style transfer model, based on adaptive instance normalization, introduced by Huang and Belongie , we transferred the artistic style of style images to the content images.
- Quantitatively, we recieved an average 7% rise in accuracy compared to baseline and 8% compared to traditional approaches.
- Qualitatively, we can observe in the CAMs that our model very well learns the semantics and context from the styled informed models.
Row 1 shows the original images. Row 2 are the CAMs for VGGFace-A and Row 3 are the CAMs for VGGFace-C (styled transfer learned) model. Once can observe in the (c) column how the context of the dress is adapted, while how model focused itself more on the facial region in Column (d).
 Eran Eidinger, Roee Enbar, and Tal Hassner. 2014. Age and gender estimation of unfiltered faces. IEEE Transactions on Information Forensics and Security 9, 12 (2014), 2170–2179.
 Rasmus Rothe, Radu Timofte, and Luc Van Gool. 2016. Deep expectation of real and apparent age from a single image without facial landmarks. International Journal of Computer Vision (IJCV) (July 2016).
 Xun Huang and Serge Belongie. 2017. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision. 1501–1510