Improving Weak Localization in Chest X-Ray Vision–Language Models Using Anatomy-Aware and Semi-Supervised Refinement

Introduction
Vision–language models (VLMs) trained on paired radiology images and reports have shown strong
performance in chest X-ray analysis, especially in weakly supervised and zero-shot settings [8, 6]. Large
scale datasets such as MIMIC-CXR enable representation learning without dense annotations [1], while
models such as ALBEF improve cross-modal alignment for downstream tasks [2].
However, localization remains difficult. Weakly supervised methods typically rely on class activation
maps (CAM) and gradient-based variants such as Grad-CAM [9, 4]. These methods often produce coarse
heatmaps that focus on highly discriminative regions rather than the full spatial extent of a pathology.
Their limitations become clear on datasets such as VinDr-CXR, which provide expert bounding box
annotations for localization evaluation [3].

Motivation
A key limitation of current weak localization methods is the lack of anatomical constraints. In chest X
rays, pathologies are spatially structured: for example, Cardiomegaly should align with the heart region,
while Pleural Effusion is typically located in lower lung or pleural regions. Standard heatmap-based
methods do not encode such priors and may therefore produce anatomically implausible activations.
Recent work shows that anatomical priors can improve localization. Approaches such as [7] integrate
anatomical information directly into the model during training, while [5] introduce anatomy-aware refinement mechanisms. However, these methods rely on architectural modifications or explicit structural modeling.
This thesis investigates a complementary direction. Instead of designing a new anatomy-aware architecture
signal, and then transferred into training so that the model learns anatomically consistent attention
maps implicitly. The central idea is therefore not only to gate heatmaps at inference time, but to learn
the gating behaviour itself during training and remove the need for a separate two-step pipeline.

Related Work
Vision–language pretraining methods such as [8, 6] enable strong classification and zero-shot performance,
but do not explicitly address localization. Weakly supervised localization methods based on CAM and
Grad-CAM remain widely used because of their simplicity, but their spatial precision is limited [9, 4].
In medical imaging, anatomy-aware approaches have been proposed to improve spatial consistency. In
particular, [7] and [5] show that anatomical information can improve weak localization. These works
motivate the present thesis, which focuses on learning anatomy-aware attention in a semi-supervised
manner without relying on major architectural redesign.

Goals of the Thesis
The goal of this thesis is to improve weak localization for one or two chest pathologies, with a primary
focus on Cardiomegaly and potentially Pleural Effusion. The first objective is to evaluate anatomy-aware refinement of class-specific heatmaps generated by a pretrained vision–language model, using anatomical masks to improve spatial plausibility. The second objective is to study an end-to-end setting in which zero-shot classification predictions are used as a gating signal for localization and evaluated on the full test set.
The third and main objective is to incorporate anatomical priors directly into training such that the
model learns to produce anatomically consistent attention maps without requiring explicit gating at
inference time. This will involve using anatomy-aware refinement as a supervisory signal and exploring
semi-supervised strategies that align model attention with anatomical structure.
Overall, the thesis aims to develop a simple and interpretable approach for improving weak localization
while maintaining minimal annotation requirements.