Rami Mahfoud “Continual learning for Object Detection in Art Historical Images” (MT Intro)

Ahmed Gomaa “Enhancing the robustness and efficiency of multimodal emotion estimation models” (MT Intro)

[Invited Talk] A Method for Efficient Minimal Solvers – Snehal Bhayani (CVPR 2020)

It is a pleasure to announce the invited talk by Snehal Bhayani, currently pursuing his doctoral studies from Center for Machine Vision and Signal Analysis at University of Oulu. Mr. Bhayani has recently published his work on minimal solvers at CVPR 2020 and is going to talk about it during the session. Please, find the details below:

Title: A Sparse Resultant Based Method for Efficient Minimal Solvers
Time: 1015 hrs, 7-September-2020
Room: CV Colloq
Paper: [CVF] [Arxiv]

The content of the talk has applications with algorithms in Computer Vision where one needs to solve camera geometry, usually as a part of RANSAC-like loops. So the applications can be reconstruction, augmented reality, camera calibration, structure from motion, etc. The idea is to model geometry problems by using minimum amount of information. Many computer algebra software can be used to solve such problems accurately and efficiently. Solving such problems is usually divided into two stages, Offline and Online. Offline stage involves the more time consuming operations that are executed once per type of problem, e.g. relative pose (stereo) for calibrated cameras can be solved from 5-point correspondences. Our approach compares against the S.O.T.A. in terms of speed and accuracy. The underlying tools are inspired from algebraic geometry.

Understanding Compositional Structures in Art Historical Images using Pose and Gaze Priors

Written by Prathmesh Madhu


Inspired from the pioneer work of Max Imdahl [1], our work focuses on generating image composition canvas (ICC) diagrams based on two central themes: (a) detection of action regions and action lines of an artwork (b) pose-based segmentation of foreground and background. In order to validate our approach qualitatively and quantitatively, we conduct a user study involving experts and non-experts. The outcome of the study highly correlates with our approach and also demonstrates its domain-agnostic capability.


  • Proof of Concept for visual attention and similarity at how artworks are interpreted by humans and machines.
  • Can help art historians towards their sophisticated art analysis. This helps in saving lot of time.
  • A step towards understanding scenes without deep learning. This means not using annotated data; further using interpretable features for image retrieval.
  • Ways to exploit existing pre-trained models and methods in computer vision for better interpretation of scenes


  • Currently, our approach works only for artworks comprising of protagonists (persons) in an image
  • No benchmark data-sets available and no evaluation metric exists, so quantitative evaluation is very difficult.
  • Our approach does not use state-of-the art gaze detection method, hence the gazes are not very precise.
  • No working applications shown so far

Overview of the Methodology

Authors: Prathmesh Madhu*, Tilman Marquart*, Ronak Kosti, Peter Bell, Andreas Maier, Vincent Christlein (* represents equal contribution)
Preprint: ArXiv Link (
ACM Published Link : Accepted, VISART, 2020 (Link coming soon)

a) action lines : Global Action line (AL) is the line that passes through the main activity in the scene. This line normally is also aligned with the central protagonists. Local Action Line or Pose Line (PL) are the lines that represent the poses of the protagonists.
b) action regions : Action Regions (AR) is (are) the main region(s) of interest. More often than not it is the region where gazes of all the protagonists theoretically meet.

Our approach uses a pre-trained OpenPose [2] network, image processing techniques and a modified k-means clustering method. The pipeline of the proposed algorithm is shown in above figure.

Our method consists of two main branches: (1) a detector for action lines and action regions and (2) the foreground/background separator.

We use the estimated poses to detect pose triangles. We then propose a simple technique (gaze cones) to obtain gaze directions estimates without involving any training or fine-tuning. Combining this information, we draw a  final ICC that estimates the image composition of the given scene under study.


  1. Quantitatively, we proposed using HausDorff (HD) distance as an evaluation metric between lines and Euclidean distance between action regions. 
    1. HD between all the annotators and ICC (ALL/ICC) is quite low (38 %) than the worst HD distance hence showing that the ALs of our ICC have very good correlation to all annotators.
    2. We observed that when the annotators’ agreement for the position of AR was higher / lower, our method predicted AR closer / farther to the labeled ones
  • Qualitative comparison with User Study (Experts + Non-Experts)
User Study analysis. 1st row shows an example of The kiss of Judas, 2nd of annunciation and 3rd of COCO [3]. 1st column are original images; 2nd shows the Action Region (AR) by ICC (cyan), all Evaluators (Red) and the centroid of all annotators; 3rd and 4th show AL and PL by ICC and all annotators respectively; and 5th shows the eye-fixation regions.
  • Cross Domain Adaptability 
Cross domain analysis. 1st row depicts Baptism, while the 2nd shows is annunciation, and the 3rd are images from COCO [3]. 1st and 3rd columns are original images, 2nd and 4th are their respective Image Composition Canvas.


[1] Imdahl, M.: Giotto, Arenafresken: Ikonographie-Ikonologie-Ikonik. Wilhelm Fink (1975)

[2] Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: OpenPose: Realtime Multi-Person 2D Pose Estimation using Part A_nity Fields. arXiv:1812.08008 [cs] (May 2019)

[3] Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C.L., Doll_ar, P.: Microsoft COCO: Common Objects in Context. arXiv:1405.0312 [cs] (Feb 2015)


Styled Transfer Learning for Art Historical Images

Written by Prathmesh Madhu


This work exploits style transfer in combination with transfer learning to recognize characters in art historical images. Our approach focuses on recognizing two central characters in the “Annunciation of the Lord” scene from Art history, Mary and Gabriel across varied art-works from different artists, times and styles. This is different compared to the existing methods in art history, where the models are trained on artworks relating to either a specific style or an artist.


  • understanding the semantics to recognize characters in art historical images irrespective of styles and times.
  • proof of concept for enhancing recognition models using style-transfer; information from styled images is beneficial for improving the performance of the deep CNN models.
  • Validating the claim that our method captures more contextual and semantic information by visualizing the class-activation maps; also to better understand the model performances and for further enhancements.
  • Performance of traditional machine learning models (SVMs, Decision Trees) decrease when they are trained on the whole body of the characters as opposed to only faces


  • It’s not an end-to-end pipeline.
  • The current algorithm is only trained for 2 characters – Mary and Gabriel. We are already working on increasing the number of characters.
  • Since, there is no existing method for art history for comparison, there is no comparison with any SOTA methods.

Overview of the Methodology

Authors: Prathmesh Madhu, Ronak Kosti, Lara Mührenberg, Peter Bell, Andreas Maier, Vincent Christlein
ACM Published Link :

We train VGGFace model on Mary and Gabriel body crops which is called VGGFace-A (baseline). We then train another VGGFace on Styled Dataset and call it VGGFace-B. We then finetune VGGFace-B by Mary and Gabriel body crops and call it VGGFace-C which is the Styled transfer learned model.

Generating Styled Dataset

Content Images : We chose to use IMDB-Wiki [1] and Adience [2] since the images are not only limited to faces, but also contain the upper body part. Combining both, we have around 20000 images belonging to each class, male and female, which we call the content images

Style Images : Our annunciation dataset (from which Mary and Gabriel body crops are generated) has 2787 images which we call the style images.

Using a style transfer model, based on adaptive instance normalization, introduced by Huang and Belongie [3], we transferred the artistic style of style images to the content images.


  1. Quantitatively, we recieved an average 7% rise in accuracy compared to baseline and 8% compared to traditional approaches.
  2. Qualitatively, we can observe in the CAMs that our model very well learns the semantics and context from the styled informed models.

Row 1 shows the original images. Row 2 are the CAMs for VGGFace-A and Row 3 are the CAMs for VGGFace-C (styled transfer learned) model. Once can observe in the (c) column how the context of the dress is adapted, while how model focused itself more on the facial region in Column (d).


[1] Eran Eidinger, Roee Enbar, and Tal Hassner. 2014. Age and gender estimation of unfiltered faces. IEEE Transactions on Information Forensics and Security 9, 12 (2014), 2170–2179.

[2] Rasmus Rothe, Radu Timofte, and Luc Van Gool. 2016. Deep expectation of real and apparent age from a single image without facial landmarks. International Journal of Computer Vision (IJCV) (July 2016).

[3] Xun Huang and Serge Belongie. 2017. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision. 1501–1510