Improving OCR for Structured Documents using Domain Knowledge

Type: MA thesis

Status: running

Date: September 15, 2024 - March 15, 2025

Supervisors: Mathias Zinnen

Thesis Description

 

The digitization of inventory cards is a recurring issue for museums and university collections. These cards hold structured data organized by individual layouts that need to be preserved when digitized. Optical Character Recognition (OCR) can be used for pure text recognition but struggles with structured content: The recognition accuracy decreases due to missing textual context and it lacks interpretation of the structured layout.

The goal of this thesis is to build a human-supported layout analysis for enabling OCR pipelines to convert inventory cards to structured data. The research aims to investigate whether OCR accuracy can be improved by incorporating prior knowledge regarding the structure and content of text fields.

 

Mandatory Goals:

  •  Design UI Application with following capabilities:
    • Card layout definition for template matching data fields
    • Detection and correction of minor shifts and rotations
    • Run OCR / Image Extraction and export to structured data (e.g. csv)
  •  (Semi-)manually annotate data set for testing and fine-tuning (ca. 100 validation / 500 training set size)
  • Fine-tune one OCR pipeline on training samples + evaluate on validation split (baseline)
  • Re-train OCR pipeline with additional data type information (int, float, string) + evaluate in comparison to baseline
  • Additional approach: Baseline OCR with postprocessing steps
    • Rule-based: ensure data consistency by category-specific rules, e.g. normalizing to default unit for weights (“12” -> “12 g”)
    • LLM: Query ChatGPT / lab-internal LLM with OCR result and expected output data type, request correction of OCR output

 

Optional Goals:

  • Add other OCR pipelines for baseline comparison
  • Introduce and train for more specific data types: weight, date, currency, dimensions
  • Test different feature fusion approaches for incorporating the data type information
  • Compare to LLM-based approach to OCR as additional baseline