This thesis explores using speech as a direct annotation modality for medical image analysis, bypassing transcription errors and enabling more lightweight models. By training a foundation model like CLIP, we aim to investigate how well speech-based annotations perform compared to text.
Tasks:
- Generate a synthetic speech dataset based on a publicly available image-text dataset
- Train a foundation model (CLIP) for annotating medical images using speech annotations
- Evaluation of the foundation model on multiple downstream tasks like:
- Zero-shot classification
- Zero-shot segmentation using MedSam
- Speech Grounding (align language with corresponding visual elements, e.g. segmentation masks)
- Evaluation of the model on a real-world high-quality dataset from radiologists
- Compare the results to an image-text model
Requirements:
- Experience with PyTorch
- Hands-on experience with training deep learning models
- Experience with Natural Language Processing (optional)
- Experience with using SLURM for job management in a GPU cluster (optional)
- Deep Learning lecture
- Pattern Recognition/Analysis lecture (optional)
Application: (Applications that do not follow the application requirements will not be considered)
- CV
- Transcript of Records
- Short motivation letter (not longer than one page)
- Email Subject: “Application Speech-CLIP” + your full name
Please send an email with your application documents to lukas.buess@fau.de
Starting Date: 01.01.2025 or later
Related Works:
[1] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PMLR.
[2] Hamamci, I. E., Er, S., Almas, F., Simsek, A. G., Esirgun, S. N., Dogan, I., … & Menze, B. (2024). Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography.
[3] Ma, J., He, Y., Li, F., Han, L., You, C., & Wang, B. (2024). Segment anything in medical images. Nature Communications, 15(1), 654.