Image Segmentation via Transformers

Type: MA thesis

Status: finished

Date: May 1, 2021 - November 1, 2021

Supervisors: Nora Gourmelon, Martin Mayr, Vincent Christlein, Andreas Maier

The recent outburst of Transformers has started after having outperformed previously known stateof-
the-art approaches like long short-term memory and gated recurrent neural networks in sequence
modelling and transduction problems such as language modelling and machine translation. Transformers
avoid recurrence and instead rely entirely on an attention mechanism to draw global dependencies
between input and output [1]. Furthermore, Transformers are now being incorporated and tested out
in domains of computer vision tasks like classification [2], detection [3], segmentation [4] and as
generative adversarial networks (GANs) [5] by considering image-patches to have a sequence-potential.
Transformer Architecture was successfully used to perform object detection which helped drop away
many hand-designed components like a non-maximum suppression procedure or anchor generation
that explicitly encodes our prior knowledge about the task. Subsequently, it was extended for panoptic
segmentation. Although, Transformers used for segmentation did not only exploit the sequencepotential
but typically still used some form of Convolutional Neural Networks (CNNs) along with it.
However, Jiang et al. has proposed a pure Transformer based model in GAN environment (TransGAN)
for image generation ensuring the possibility of dropping CNNs in GANs [5].
In this work, the idea of using image patches as a sequence input into a Transformer model without
CNNs is carried out for segmentation tasks.
The thesis consists of the following milestones:

  • Modifying TransGAN discriminator and generator as encoder and decoder respectively for
  • Evaluating performance on the Cityscapes dataset [6].
  • Further experiments and improvements regarding learning and network architecture.
    The implementation should be done in PyTorch Lightning.


[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,
and Illia Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on
Neural Information Processing Systems, pages 6000–6010, 2017.

[2] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16×16
words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.

[3] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey
Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision,
pages 213–229. Springer, 2020.

[4] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng
Feng, Tao Xiang, Philip HS Torr, et al. Rethinking semantic segmentation from a sequence-to-sequence
perspective with transformers. arXiv preprint arXiv:2012.15840, 2020.

[5] Yifan Jiang, Shiyu Chang, and Zhangyang Wang. Transgan: Two transformers can make one strong gan.
arXiv preprint arXiv:2102.07074, 2021.

[6] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson,
Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding.
In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.