This thesis explores using speech as a direct annotation modality for medical image analysis, bypassing transcription errors and enabling more lightweight models. By training a foundation model like CLIP, we aim to investigate how well speech-based annotations perform compared to text.
Tasks:
- Generate a synthetic speech dataset based on a publicly available image-text dataset
- Train a foundation model (CLIP) for annotating medical images using speech annotations
- Evaluation of the foundation model on multiple downstream tasks like:
- Zero-shot classification
- Zero-shot segmentation using MedSam
- Speech Grounding (align language with corresponding visual elements, e.g. segmentation masks)
- Evaluation of the model on a real-world high-quality dataset from radiologists
- Compare the results to an image-text model
Related Works:
[1] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PMLR.
[2] Hamamci, I. E., Er, S., Almas, F., Simsek, A. G., Esirgun, S. N., Dogan, I., … & Menze, B. (2024). Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography.
[3] Ma, J., He, Y., Li, F., Han, L., You, C., & Wang, B. (2024). Segment anything in medical images. Nature Communications, 15(1), 654.