Lecture Notes in Deep Learning: Segmentation and Object Detection – Part 5

Symbolic picture for the article. The link opens the image in a large view.

Instance Segmentation

These are the lecture notes for FAU’s YouTube Lecture “Deep Learning“. This is a full transcript of the lecture video & matching slides. We hope, you enjoy this as much as the videos. Of course, this transcript was created with deep learning techniques largely automatically and only minor manual modifications were performed. Try it yourself! If you spot mistakes, please let us know!

Instance segmentation can also be used for video editing. Image created using gifify. Source: YouTube

Welcome back to deep learning! Today, we want to talk about the last part of object detection and segmentation. We want to look into the concept of instance segmentation.

Semantic segmentation vs. instance segmentation. Image under CC BY 4.0 from the Deep Learning Lecture.

So, let’s have a look at our slides. You see this is already the last part. Part five and now we want to talk about instance segmentation. We do not just want to detect where pixels with cubes are instead of pixels of cups. We want to really figure out which pixels belong to what cube. This is essentially a combination of object detection and semantic segmentation.

Examples for instance segmentation approaches. Image under CC BY 4.0 from the Deep Learning Lecture.

Examples for potential applications are information about occlusion, counting the number of elements belonging to the same class, detecting object boundaries for example of gripping objects in robotics. This is very important and there are examples in the literature for simultaneous detection and segmentation, DeepMask, SharpMask, and Mask RCNN in [10].

Mask RCNN is one of the mostly wide-spread instance segmentation approaches. Image under CC BY 4.0 from the Deep Learning Lecture.

Let’s look at [10] in a little more detail. We essentially go back to the storage. We combine object detection and the segmentation. We use RCNN for object detection. It essentially solves the instance separation. Then, the segmentation refines the bounding boxes per instance.

Concept of Mask RCNN. Image under CC BY 4.0 from the Deep Learning Lecture.

The workflow is a two-stage procedure. You have the region proposal that proposes the object bounding boxes. Then, you have the classification using a bounding box regression and the segmentation in parallel. So, you have a multi-task loss that essentially combines the pixel-classification loss of the segmentation, the box loss, and the class loss for producing the right class/bounding box. So, you have these three terms that are then combined in a multi-task loss.

Architecture of Mask RCNN. Image under CC BY 4.0 from the Deep Learning Lecture.

Let’s look in some more detail into the two-stage procedure. You have two different options here for two-stage networks. You can have a joint branch that is working on the ROIs and then splits at a later stage into the segmentation of the mask and the class and bounding box prediction, or you can split early. Then, you run that in two separate networks, In both versions, you have this multi-task loss and that combines the pixel-wise segmentation loss, the box loss, and the class loss.

Instance segmentation results. Image under CC BY 4.0 from the Deep Learning Lecture.

Let’s have a look at some examples. These are results again from mask RCNN. You can see that to be honest these are quite impressive results. So, there are really difficult cases. You identify where the persons are and you also see that the different persons, of course, are different instances. So, very impressive results!

Mask RCNN is also suited to support autonomous driving. Image created using gifify. Source: YouTube.

So let’s summarize what we’ve seen so far. The segmentation is commonly solved by architectures analyzing the image and subsequently refining the coarse results. Fully convolutional networks preserve the spatial layout and enable arbitrary input sizes with pooling.

Summary of segmentation and object detection. Image under CC BY 4.0 from the Deep Learning Lecture.

We can use object detectors and implement them as a sequence of region proposals and classification. Then this leads essentially to the family of RCNN-type of networks. Alternatively, you can go to single-shot detectors. We looked at YOLO which is a very common and very fast technique such as YOLO9000. We looked into RetinaNet if you really have a scale dependency and you want to detect on many different scales like for the example of histological slice processing. So, object detection and segmentation are closely related and combinations are common as you have seen here for the purpose of instance segmentation.

More exciting things coming up in this deep learning lecture. Image under CC BY 4.0 from the Deep Learning Lecture.

Let’s look at what we still have to talk about in this lecture. Coming up very soon are methods to relieve the burden of labeling. So, we will talk about weak annotation. How we can generate labels? This then also leads to the concept of self-supervision which is a very popular topic right now. It’s been very heavily used in order to generate better networks. The methods are able to reuse also sparsely or even completely unlabeled data. We will look into some of the more advanced methods. One idea that I want to show to you later is the use of known operators. How we can integrate knowledge into networks? Which properties does this have? and we also demonstrate some ideas on how we could potentially make parts of networks reusable. So, there are exciting things still coming up.

Some comprehensive questions. Image under CC BY 4.0 from the Deep Learning Lecture.

I have some comprehensive questions for you like “What is the difference between semantic and instance segmentation?”, “What is the connection to object detection?”, “How can we construct a network which accepts arbitrary input sizes?”, “What is ROI pooling?”, “How can we perform backpropagation through an ROI pooling layer?”, “What are typical measures for the evaluation of segmentations?”, or for example I could ask you to explain a method for instance segmentation.

I have a couple of further readings in terms of links. So, there is this awesome website by Joseph Redmond the creator of Yolo. I think this is a really nice library that is called darknet. You can also study Joseph’s Redmon’s CV I have the link here. I think if you follow this kind of layout, this will definitely jumpstart your career. Please take your time and also look at the references below. we selected really very good state-of-the-art papers and we can definitely recommend having a look at them. So thank you very much for listening to this lecture and I hope you liked our short excursion to the more applied fields like segmentation and object detection. I hope that this turns out to be useful for you and I also hope that we will see again in one of our next videos. So, thank you very much and goodbye!

If you liked this post, you can find more essays here, more educational material on Machine Learning here, or have a look at our Deep LearningLecture. I would also appreciate a follow on YouTubeTwitterFacebook, or LinkedIn in case you want to be informed about more essays, videos, and research in the future. This article is released under the Creative Commons 4.0 Attribution License and can be reprinted and modified if referenced. If you are interested in generating transcripts from video lectures try AutoBlog.


[1] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. “Segnet: A deep convolutional encoder-decoder architecture for image segmentation”. In: arXiv preprint arXiv:1511.00561 (2015). arXiv: 1311.2524.
[2] Xiao Bian, Ser Nam Lim, and Ning Zhou. “Multiscale fully convolutional network with application to industrial inspection”. In: Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on. IEEE. 2016, pp. 1–8.
[3] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, et al. “Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs”. In: CoRR abs/1412.7062 (2014). arXiv: 1412.7062.
[4] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, et al. “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs”. In: arXiv preprint arXiv:1606.00915 (2016).
[5] S. Ren, K. He, R. Girshick, et al. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. In: vol. 39. 6. June 2017, pp. 1137–1149.
[6] R. Girshick. “Fast R-CNN”. In: 2015 IEEE International Conference on Computer Vision (ICCV). Dec. 2015, pp. 1440–1448.
[7] Tsung-Yi Lin, Priya Goyal, Ross Girshick, et al. “Focal loss for dense object detection”. In: arXiv preprint arXiv:1708.02002 (2017).
[8] Alberto Garcia-Garcia, Sergio Orts-Escolano, Sergiu Oprea, et al. “A Review on Deep Learning Techniques Applied to Semantic Segmentation”. In: arXiv preprint arXiv:1704.06857 (2017).
[9] Bharath Hariharan, Pablo Arbeláez, Ross Girshick, et al. “Simultaneous detection and segmentation”. In: European Conference on Computer Vision. Springer. 2014, pp. 297–312.
[10] Kaiming He, Georgia Gkioxari, Piotr Dollár, et al. “Mask R-CNN”. In: CoRR abs/1703.06870 (2017). arXiv: 1703.06870.
[11] N. Dalal and B. Triggs. “Histograms of oriented gradients for human detection”. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Vol. 1. June 2005, 886–893 vol. 1.
[12] Jonathan Huang, Vivek Rathod, Chen Sun, et al. “Speed/accuracy trade-offs for modern convolutional object detectors”. In: CoRR abs/1611.10012 (2016). arXiv: 1611.10012.
[13] Jonathan Long, Evan Shelhamer, and Trevor Darrell. “Fully convolutional networks for semantic segmentation”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, pp. 3431–3440.
[14] Pauline Luc, Camille Couprie, Soumith Chintala, et al. “Semantic segmentation using adversarial networks”. In: arXiv preprint arXiv:1611.08408 (2016).
[15] Christian Szegedy, Scott E. Reed, Dumitru Erhan, et al. “Scalable, High-Quality Object Detection”. In: CoRR abs/1412.1441 (2014). arXiv: 1412.1441.
[16] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. “Learning deconvolution network for semantic segmentation”. In: Proceedings of the IEEE International Conference on Computer Vision. 2015, pp. 1520–1528.
[17] Adam Paszke, Abhishek Chaurasia, Sangpil Kim, et al. “Enet: A deep neural network architecture for real-time semantic segmentation”. In: arXiv preprint arXiv:1606.02147 (2016).
[18] Pedro O Pinheiro, Ronan Collobert, and Piotr Dollár. “Learning to segment object candidates”. In: Advances in Neural Information Processing Systems. 2015, pp. 1990–1998.
[19] Pedro O Pinheiro, Tsung-Yi Lin, Ronan Collobert, et al. “Learning to refine object segments”. In: European Conference on Computer Vision. Springer. 2016, pp. 75–91.
[20] Ross B. Girshick, Jeff Donahue, Trevor Darrell, et al. “Rich feature hierarchies for accurate object detection and semantic segmentation”. In: CoRR abs/1311.2524 (2013). arXiv: 1311.2524.
[21] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. “U-net: Convolutional networks for biomedical image segmentation”. In: MICCAI. Springer. 2015, pp. 234–241.
[22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, et al. “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition”. In: Computer Vision – ECCV 2014. Cham: Springer International Publishing, 2014, pp. 346–361.
[23] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, et al. “Selective Search for Object Recognition”. In: International Journal of Computer Vision 104.2 (Sept. 2013), pp. 154–171.
[24] Wei Liu, Dragomir Anguelov, Dumitru Erhan, et al. “SSD: Single Shot MultiBox Detector”. In: Computer Vision – ECCV 2016. Cham: Springer International Publishing, 2016, pp. 21–37.
[25] P. Viola and M. Jones. “Rapid object detection using a boosted cascade of simple features”. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision Vol. 1. 2001, pp. 511–518.
[26] J. Redmon, S. Divvala, R. Girshick, et al. “You Only Look Once: Unified, Real-Time Object Detection”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 2016, pp. 779–788.
[27] Joseph Redmon and Ali Farhadi. “YOLO9000: Better, Faster, Stronger”. In: CoRR abs/1612.08242 (2016). arXiv: 1612.08242.
[28] Fisher Yu and Vladlen Koltun. “Multi-scale context aggregation by dilated convolutions”. In: arXiv preprint arXiv:1511.07122 (2015).
[29] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, et al. “Conditional Random Fields as Recurrent Neural Networks”. In: CoRR abs/1502.03240 (2015). arXiv: 1502.03240.
[30] Alejandro Newell, Kaiyu Yang, and Jia Deng. “Stacked hourglass networks for human pose estimation”. In: European conference on computer vision. Springer. 2016, pp. 483–499.