As of recent, transformer models have started to outperform the classic deep convolutional neural networks in many classic computer vision tasks. These transformer models consist of multi-headed self-attention layers followed by linear layers. The former layer soft-routes value information based on three matrix embeddings: query, key and value. The inner product of query and key are input into a softmax function for normalization and the resulting similarity matrix is multiplied with the value embedding. Multi-headed self-attention creates multiple sets of query, key and value matrices that are independently computed, then concatenated and projected into the original embedding dimension. Visual transformers excel in their ability to incorporate non-local information into their latent representation, allowing for better results when classification relevant information is scattered across the entire image.
The downside of pure attention models like ViT , which treat image-patches as sequence-tokens, is the requirement of lots of samples to make up for their lack of inductive priors. This makes them unsuitable for low-data regimes like historical document analysis. Further, the computation of the similarity matrix leads to a matrix quadratic in input length, complicating high-resolution computations.
One solution promising to alleviate the data hunger of transformers while still profiting from their global representation ability, is the usage of hybrid methods that combine CNN and self-attention layers. Those models jointly train a network comprised of a number of convolutional layers to preprocess and downsample inputs, followed by a form of multi-headed self-attention.  differentiates hybrid self-attention models into “transformer blocks” and “non-local blocks”, the latter of which is equivalent to single-headed self-attention sans the lack of value embeddings and positional encodings.
The objective of this thesis is the classification of script type, date and location of historical documents, using a single multi-headed hybrid self-attention model.
The thesis consists of the following milestones:
- Construction of hybrid models for classification
- Benchmarking on the ICDAR 2021 competition dataset
- Further architectural analyses of hybrid self-attention models
 Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Un-terthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and NeilHoulsby. An image is worth 16×16 words: Transformers for image recognition at scale. InInternationalConference on Learning Representations, 2021.
 Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, and Ashish Vaswani. Bottleneck transformers for visual recognition, 2021.