Concentrating on Text for Improved Document Analysis

Type: MA thesis

Status: finished

Date: April 20, 2020 - October 20, 2020

Supervisors: Vincent Christlein, Tino Haderlein, Andreas Maier

In recent years, deep learning as a popular research direction, has also been applied to text recognition
tasks by many researchers [1]. The variability of text and background interference makes text recogni-
tion becomes a challenging task, i.e. the deformation or bending of the scene text will cause various
recognition errors. To improve the performance of such problems, Shi et al. [2] have proposed RARE
(Robust text recognizer with Automatic Rectification), a recognition model that is robust to irregular
text. Writer Identification is the process of identifying the writer of a given text. Convolutional Neural
Networks (CNNs) as a state-of-the-art tool have been used for writer identification[3][4].
However, the use of Convolutional Neural Networks (CNNs) may face some limitations: Training and
running a deep learning model requires a large amount of computational power [5]. In addition, such
image-based text recognition process for scene text recognition or image-based writer identification,
could be affected by the background. Typically, a text document contains less than 5% of text pixels,
the rest is background. A mask, with which helps to concentrate on the foreground, i.e. the text pixels,
would be beneficial. Some researches of object detection [5] and image inpainting [6] have proposed
to use a binary mask, where the convolution is masked and renormalized to be conditioned on only
valid pixels. Furthermore, a new method “Self-Attention” has been proposed [7], the authors combine
Self-Attention with GANs to generates consistent scenarios by leveraging complementary features
in distant portions of the image rather than local regions of fixed shape. The idea of “Self-Attention”
could also be investigated in our task.
In this work, a binary mask or the method of self-attention is incorporated as a matrix into a deep neural
network architecture to focus on the foreground and ignore the background, allowing for end-to-end
training of the network for different document analysis tasks, such as writer identification.
The thesis consists of the following milestones:
• Incorporate partial convolutions [6] and evaluate them with pre-computed binary masks for
writer identification.
• Evaluate the influence of using a binarization mask to regularize the loss by means of the
Frobenius norm.
• Learn the parts to focus by using techniques, such as self-attention [7] or BABO [5].
• Thorough evaluation of the different methods and combinations for different document analysis
tasks.
• Further experiments regarding learning procedure and network architecure.
The implementation should be done in Python.

[1] B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural network for image-based sequence recognition and
its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence,
39(11):2298–2304, 2017.
[2] Baoguang Shi, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. Robust scene text recognition
with automatic rectification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
June 2016.
[3] Vincent Christlein, Markus Diem, Florian Kleber, Günter Mühlberger, Verena Schwägerl-Melchior, Esther
van Gelder, and Andreas Maier. Automatic Writer Identification in Historical Documents: A Case Study.
Zeitschrift für digitale Geisteswissenschaften, 2016.
[4] Vincent Christlein, David Bernecker, Andreas Maier, and Elli Angelopoulou. Offline Writer Identification
Using Convolutional Neural Network Activation Features. In Juergen Gall, Peter Gehler, and Bastian Leibe,
editors, Pattern Recognition, Lecture Notes in Computer Science, pages 540–552, Berlin, 2015.
[5] Byungseok Roh, Han-Cheol Cho, Myung-Ho Ju, and Soon Hyung Pyo. Babo: Background activation
black-out for efficient object detection, 2020.
[6] Guilin Liu, Fitsum A. Reda, Kevin J. Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro. Image
inpainting for irregular holes using partial convolutions. In The European Conference on Computer Vision
(ECCV), September 2018.
[7] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial
networks, 2018.