End-to-end Deep Learning based Writer Identification

Writer identification is the task of finding the correct writer for a certain document. State-of-the-art
writer identification systems applying deep convolutional neural networks (CNN) consist of three
components [1] [2]. First of all, the local image patches are extracted based on the keypoints of
Scale-Invariant Feature Transform (SIFT). Subsequently, the local descriptors for these patches are
computed using the penultimate layer of a deep CNN. Finally, a global descriptor used for comparison
is computed by aggregation after embedding of local descriptors. For example, Keglevic et al.[1] map
image patches into a global descriptor using a triplet network followed by VLAD encoding.
However, the above systems, which comprise of three individual algorithmic components, still have
disadvantages that the patch extraction and feature encoding are not able to be optimized in an end-
to-end fashion. Ren et al. [3] introduce a Region Proposal Network (RPN) to generate high-quality
region proposal with a deep network. The RPN which shares convolutional layers with object detection
networks needs nearly none additional computation. Moreover, Zhang et al. [4] generalize VLAD
encoding and propose a Deep Texture Encoding Network (Deep-TEN) with a learnable encoding layer
that achieves supervised feature aggregation. Instead of using an average pooling, Christlein et al. [5]
propose to use Deep Generalized Max Pooling (DGMP) for the computation of the weights of local
activation vectors in order to balance frequent and rare embeddings that consist of locally coherent
activations.
In this work, the currently keypoints based patch extraction part is replaced by integrating it into the
RPN. The VLAD encoding part is replaced by Deep-TEN layer. Meanwhile, the currently average
pooling mechanism would be exchanged by DGMP. Ultimately, an end-to-end trainable network for
writer identification should be established.
The thesis consists of the following milestones:
• Incorporating RPN, Deep-TEN layer and DGMP into a neural network.
• Evaluating performance on the ICDAR17 competition dataset on historical document writer
identification [6].
• Comparing the effects of each stage against its non-deep learning part.
• Further experiments regarding learning procedure and network architecture.
The implementation should be done in Python.

[1] M. Keglevic, S. Fiel, and R. Sablatnig. Learning features for writer retrieval and identification using triplet
cnns. In 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pages
211–216, Aug 2018.
[2] Vincent Christlein, Martin Gropp, Stefan Fiel, and Andreas K. Maier. Unsupervised feature learning for
writer identification and writer retrieval. 2017 14th IAPR International Conference on Document Analysis
and Recognition (ICDAR), 01:991–997, 2017.
[3] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection
with region proposal networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,
editors, Advances in Neural Information Processing Systems 28, pages 91–99. Curran Associates, Inc., 2015.
[4] Hang Zhang, Jia Xue, and Kristin Dana. Deep ten: Texture encoding network. In The IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), July 2017.
[5] Vincent Christlein, Lukas Spranger, Mathias Seuret, Anguelos Nicolaou, Pavel Král, and Andreas Maier.
Deep generalized max pooling. 2019.
[6] S. Fiel, F. Kleber, M. Diem, V. Christlein, G. Louloudis, S. Nikos, and B. Gatos. Icdar2017 competition on
historical document writer identification (historical-wi). In 2017 14th IAPR International Conference on
Document Analysis and Recognition (ICDAR), volume 01, pages 1377–1382, Nov. 2018.