Learning Human-Aligned Evaluation Metrics for Radiology Reports

Type: MA thesis

Status: open

Supervisors: Lukas Buess, Andreas Maier

This master thesis focuses on developing a human-aligned evaluation metric for radiology report generation. Using large-scale medical datasets, you will train and analyze language models to assess report quality in a way that better reflects human and clinical preferences.

Tasks:

  • Dataset preparation
  • LLM finetuning
  • Comprehensive evaluation

Requirements:

  • Experience with PyTorch and training models
  • Experience with vision or language models
  • (Optional) Experience using SLURM
  • (Recommended) Deep Learing / Pattern Recognition Lecture

Application: (Applications that do not follow the application requirements will not be considered)
Please send your CV, transcript of records, and short motivation letter (1 page max) with the subject “Application ReportMetric + your_full_name” to Lukas.Buess@fau.de

Start Date: 15.01.2026 or later

Relevant Literature:
[1] Johnson, A. E., Pollard, T. J., Berkowitz, S. J., Greenbaum, N. R., Lungren, M. P., Deng, C. Y., … & Horng, S. (2019). MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1), 317.
[2] Pellegrini, C., Özsoy, E., Busam, B., Navab, N., & Keicher, M. (2023). Radialog: A large vision-language model for radiology report generation and conversational assistance. arXiv preprint arXiv:2311.18681.
[3] Hamamci, I. E., Er, S., Wang, C., Almas, F., Simsek, A. G., Esirgun, S. N., … & Menze, B. (2024). Developing generalist foundation models from a multimodal dataset for 3d computed tomography. arXiv preprint arXiv:2403.17834.
[4] Blankemeier, L., Cohen, J. P., Kumar, A., Van Veen, D., Gardezi, S. J. S., Paschali, M., … & Chaudhari, A. S. (2024). Merlin: A vision language foundation model for 3d computed tomography. Research Square, rs-3.
[5] Ostmeier, S., Xu, J., Chen, Z., Varma, M., Blankemeier, L., Bluethgen, C., … & Delbrouck, J. B. (2024). Green: Generative radiology report evaluation and error notation. arXiv preprint arXiv:2405.03595.
[6] Xu, J., Zhang, X., Abderezaei, J., Bauml, J., Boodoo, R., Haghighi, F., … & Delbrouck, J. B. (2025, November). RadEval: A framework for radiology text evaluation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 546-557).