The recent outburst of Transformers has started after having outperformed previously known stateof-
the-art approaches like long short-term memory and gated recurrent neural networks in sequence
modelling and transduction problems such as language modelling and machine translation. Transformers
avoid recurrence and instead rely entirely on an attention mechanism to draw global dependencies
between input and output [1]. Furthermore, Transformers are now being incorporated and tested out
in domains of computer vision tasks like classification [2], detection [3], segmentation [4] and as
generative adversarial networks (GANs) [5] by considering image-patches to have a sequence-potential.
Transformer Architecture was successfully used to perform object detection which helped drop away
many hand-designed components like a non-maximum suppression procedure or anchor generation
that explicitly encodes our prior knowledge about the task. Subsequently, it was extended for panoptic
segmentation. Although, Transformers used for segmentation did not only exploit the sequencepotential
but typically still used some form of Convolutional Neural Networks (CNNs) along with it.
However, Jiang et al. has proposed a pure Transformer based model in GAN environment (TransGAN)
for image generation ensuring the possibility of dropping CNNs in GANs [5].
In this work, the idea of using image patches as a sequence input into a Transformer model without
CNNs is carried out for segmentation tasks.
The thesis consists of the following milestones:
- Modifying TransGAN discriminator and generator as encoder and decoder respectively for
segmentation - Evaluating performance on the Cityscapes dataset [6].
- Further experiments and improvements regarding learning and network architecture.
The implementation should be done in PyTorch Lightning.
