Invited Talk – Prof. Liangrui Peng (Tsinghua University): Exploring Representation, Attention, and Memory Mechanisms for Text Recognition, Friday, December 19th, 2025, 9 AM CET
It’s a great pleasure to welcome Prof. Liangrui Peng as a speaker in our lab!
Title: Exploring Representation, Attention, and Memory Mechanisms for Text Recognition
Date: Friday, December 19th, 2025, 9 AM CET
Location: https://fau.zoom-x.de/j/61781168039?pwd=PV8PLP5qScT6aGBphxevrPu3OXPucB.1
Abstract:
Text recognition is crucial for many mobile-era applications, including scene text recognition, handwriting recognition and document recognition. The core challenge is to design efficient representation, attention, and memory mechanisms for text recognition. This talk first introduces primitive representation learning for sequence modeling in text recognition. Primitive representations are learned via global feature aggregation and then transformed into high-level visual text representations using a graph convolutional network, enabling parallel decoding for text transcription. A multi-element attention mechanism is further introduced to better exploit spatial and temporal feature information. The proposed method was evaluated on multilingual text recognition tasks and currently holds the top rank on the RRT-MLT 2019 (Task 4) leaderboard. The talk then investigates how to incorporate memory mechanisms into an LLM-based document recognition framework. Large multimodal models (LMMs) have shown promising performance on various document recognition tasks; however, their implicit modeling leads to parameters that lack interpretability. Inspired by advances in human memory and learning, we propose an explicit multiscale prototype memory that augments document recognition models by explicitly modeling recurrent layout and stylistic patterns across spatial resolutions. A Memory Retrieval Mechanism enables local regions to sparsely attend to a few prototypes (e.g., image borders, tilted text); the retrieved compositional factors are concatenated with visual features and passed to the decoder, providing explicit region-wise structural context. Prototype memory consolidation updates and stabilizes prototypes via an attention-weighted exponential moving average (EMA), while sparsity and anti-collapse regularization promote selective activation and disentanglement. We further adopt hierarchical memory and a scale-adaptive attention module for multi-resolution encoding, trained with a multi-task, entropy-regularized objective. Experiments on the Fox dataset and our self-built DreamDoc dataset demonstrate the effectiveness of the proposed methods.
Bio: Liangrui Peng is currently an associate professor at the Department of Electronic Engineering, Tsinghua University, Beijing, China. She received her Ph.D. degree in Information and Communication Engineering from Tsinghua University in 2010. Her research interests include computer vision, machine learning and multilingual text recognition. She has received the National Awards for Science and Technology Progress (Second Class) in China three times. Her recent research work with graduate students has advanced multilingual text recognition, receiving multiple awards including the DAS 2016 Best Paper Award, the ICDAR 2019 Best Student Paper Runner Up Award, and the DRR 2015 Best Student Paper Award. Her team also won the ICDAR2017 and ICPR 2020 competitions on Text Detection and Recognition in Arabic News Videos.

