Report Generation and Evaluation for 3D CT Scans Using Large Vision-Language Models

Automatically generating reports for various medical imaging modalities is a topic that has received a lot of attention since the advancements in large multimodal models (LMMs) [1]. While the quality of generated reports, specifically in terms of diagnostic accuracy, cannot yet match reports written by expert radiologists, it has been shown that even imperfect reports can be used by radiologists as a starting point to improve the efficiency of their workflow [2].

This Master Thesis focuses on generating reports for 3D chest CT scans, using the CT-RATE dataset [3], which contains over 25,000 scans with matching anonymized reports written by expert personnel. It will also utilize the work of RadGenome-Chest CT [4], that includes a sentence segmentation of reports based on the anatomical regions they are referencing.

The first part of the thesis focuses on finding and implementing suitable metrics for evaluating the quality of a generated report against reference reports. This continues to be a field of active research, as currently used metrics do not fully align with human preference. In this thesis, both traditional methods based on n-gram overlap such as BLEU [5] will be employed, while also looking into more recent metrics such as GREEN-score [6], which is based on a finetuned LLM.

The second part will focus on training a report generation model using the architecture of CT-CHAT [3]. CT-CHAT is a model that has been trained on variations of the CT-RATE dataset as a vision-language assistant that can reason about chest CT-scans. First, a baseline model will be trained solely on the task of recreating variations of the CT-RATE ground truth reports. Next, the model will be trained to break the report generation down into smaller tasks, such as analyzing one anatomical region at a time, inspired by Chain-of-Thought approaches [7], in an attempt to improve report quality.

[1] L. Guo, A. M. Tahir, D. Zhang, Z. J. Wang, und R. K. Ward, „Automatic Medical Report Generation: Methods and Applications“, SIP, Bd. 13, Nr. 1, 2024, doi: 10.1561/116.20240044.

[2] J. N. Acosta u. a., „The Impact of AI Assistance on Radiology Reporting: A Pilot Study Using Simulated AI Draft Reports“, 16. Dezember 2024, arXiv: arXiv:2412.12042. doi: 10.48550/arXiv.2412.12042.

[3] I. E. Hamamci u. a., „Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography“, 16. Oktober 2024, arXiv: arXiv:2403.17834. doi: 10.48550/arXiv.2403.17834.

[4] X. Zhang u. a., „RadGenome-Chest CT: A Grounded Vision-Language Dataset for Chest CT Analysis“, 25. April 2024, arXiv: arXiv:2404.16754. doi: 10.48550/arXiv.2404.16754.

[5] K. Papineni, S. Roukos, T. Ward, und W.-J. Zhu, „BLEU: a method for automatic evaluation of machine translation“, in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics – ACL ’02, Philadelphia, Pennsylvania: Association for Computational Linguistics, 2001, S. 311. doi: 10.3115/1073083.1073135.

[6] S. Ostmeier u. a., „GREEN: Generative Radiology Report Evaluation and Error Notation“, in Findings of the Association for Computational Linguistics: EMNLP 2024, 2024, S. 374–390. doi: 10.18653/v1/2024.findings-emnlp.21.

[7] Y. Jiang u. a., „CoMT: Chain-of-Medical-Thought Reduces Hallucination in Medical Report Generation“, 28. Februar 2025, arXiv: arXiv:2406.11451. doi: 10.48550/arXiv.2406.11451.