Writer Identification using Transformer-based Deep Neural Networks

Type: MA thesis

Status: running

Date: August 1, 2021 - February 1, 2022

Supervisors: Vincent Christlein, Martin Mayr, Andreas Maier

Writer identification is an application of biometric identification by handwriting. In conventional
machine-learning-based methods, hand-crafted features are extracted to compute a global embedding
of the handwriting images [1]. State-of-the-art deep learning techniques have also shown comparable
performance in writer identification tasks [2][3]. The automatic features are extracted by convolutional
layers. A deep-learning-based writer identification method for historical document follows the follo-
wing pipeline: Firstly, the locations containing the handwritings of a historical document are chosen.
Then, the network extracts local feature descriptors. Afterwards, normalized local descriptors are
encoded and aggregated into a global descriptor. Finally, the similarity of each two global descriptors
is computed. However, using CNN-based methods only the parts that are extracted by the filters are
kept, while CNN fails to encode the spatial relations between these learned features.
Recently, Dosovitskiy et al. [4] proposed the Vision Transformer (ViT) to classify images. Unlike
CNNs, ViT introduces self-attention and learns the relations of pixels and feeds the entire information
of an image to the model. In Dosovitskiy et al.’s study, ViT attained superior performance with fewer
computational resources in training when compared to state-of-the-art CNNs.
In this work, a new approach based on Transformer will be deployed to create the global descriptor of
a historical handwriting document and identify the authorship. We will compare the new approach
with some state-of-the-art methods to investigate the effect of introducing Transformer into writer
identification tasks.
The thesis consists of the following milestones:
• Creating a global embedding of the document image and identifying writers using a Transformer-
based neural network.
• Evaluating performance on the ICDAR17 competition dataset on historical document writer
• Evaluating different loss functions.
• Comparing with the Vector of Linearly Aggregated Descriptors (VLAD) encoding.
• Experimenting with other network architectures, comparing training speed and performance.
The implementation should be done in Python.