Master Thesis – Annotation by Speech in Radiology

This thesis explores using speech as a direct annotation modality for medical image analysis, bypassing transcription errors and enabling more lightweight models. By training a foundation model like CLIP, we aim to investigate how well speech-based annotations perform compared to text.

Tasks:

  1. Generate a synthetic speech dataset based on a publicly available image-text dataset
  2. Train a foundation model (CLIP) for annotating medical images using speech annotations
  3. Evaluation of the foundation model on multiple downstream tasks like:
    • Zero-shot classification
    • Zero-shot segmentation using MedSam
    • Speech Grounding (align language with corresponding visual elements, e.g. segmentation masks)
  4. Evaluation of the model on a real-world high-quality dataset from radiologists
  5. Compare the results to an image-text model

 

Requirements:

  • Experience with PyTorch
  • Hands-on experience with training deep learning models
  • Experience with Natural Language Processing (optional)
  • Experience with using SLURM for job management in a GPU cluster (optional)
  • Deep Learning lecture
  • Pattern Recognition/Analysis lecture (optional)

 

Application: (Applications that do not follow the application requirements will not be considered)

  • CV
  • Transcript of Records
  • Short motivation letter (not longer than one page)
  • Email Subject: “Application Speech-CLIP” + your full name

Please send an email with your application documents to lukas.buess@fau.de

 

Starting Date: 01.01.2025 or later

Related Works:

[1] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PMLR.

[2] Hamamci, I. E., Er, S., Almas, F., Simsek, A. G., Esirgun, S. N., Dogan, I., … & Menze, B. (2024). Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography.

[3] Ma, J., He, Y., Li, F., Han, L., You, C., & Wang, B. (2024). Segment anything in medical images. Nature Communications, 15(1), 654.