Font Group Recognition for Improved OCR

(Third Party Funds Single)

Overall project:
Project leader: Vincent Christlein
Project members:
Start date: August 1, 2021
End date: August 1, 2023
Acronym:
Funding source: DFG-Einzelförderung / Sachbeihilfe (EIN-SBH)
URL:

Abstract

Although OCR-D made huge progress in the last project phase in providing OCR for early printed books, it still faces two major problems: The huge variety of the material makes it extremely challenging to use generic OCR-models. Yet, selecting specific models is not possible as the sheer amount of material prevents a fully automatic workflow. This situation is further complicated by the lack of appropriate OCR training data. Current data sets consist overwhelmingly of texts in Fraktur, especially from the 19th century. This completely neglects the large typographic variety displayed by printing in the three previous centuries. Therefore, and in response to the demand from SLUB Dresden and ULB Halle, we propose to improve the current situation significantly1) fine tuning our font group recognition system to such a degree that it can be used at character level;2) transcribing more specific OCR training data for the 16th-18th century, which includes popular fonts such as Schwabacher, other bastards and old Fraktur styles; 3) training font-specific OCR models as well as integrated models that recognise both typeface and text simultaneously. This approach has ensured in other contexts that the network performs better on both individual tasks, as we can thus reduce overfitting during training. This project will improve OCR quality significantly, especially for books in non-Fraktur fonts. It will also provide a training data set of very high quality that can be reused in long term. Finally, the project will provide a more fine-grained font recognition tool that, beyond enabling font-specific OCR, also has important applications in text attribute recognition and layout analysis.

Font Group Recognition for Improved OCR

Font Group Recognition for Improved OCR

Abstract

Publications