Index
Denoising and Inpainting of 3D OCT images using Deep Learning
As a non-invasive 3D optical imaging modality that operates on micrometer-scale, Optical Coherence Tomography (OCT) has become a standard of care in ophthalmology [1].
However, OCT imaging in general is a noisy process, with two of the typical noise sources being detection noise and laser speckle [2], [3]. There are multiple approaches for image enhancement. Due to the lack of ground truth data, deep learning approaches are often unsupervised. Noise2Noise [4] learns a denoising operation on images without actually needing clean versions of the samples during the training step. Instead, they use assumptions about the statistical nature of noise compared to actual data [4], [3]. An example where deep learning has been employed to improve OCT-related data before, is given in [3]. This work is primarily optimized for a low latency scenario and works by employing an unsupervised blind-spot denoising network that is trained on a masked version of the original data. A more complex approach to generate high quality data is volume fusion. Volume fusion is a 3 step process, which is comprised of motion correction of multiple OCT images, e.g., [6], illumination correction of brightness artifacts, e.g., [7], and merging of the resulting data. Results in [5] demonstrate signal enhancement and improve visibility of subtle retinal features on a micrometer scale. However, the authors of [5] suggest to use around 4–6 volumes for clean results. While using a lower number of images would be preferable for efficient clinical screening, using only two volumes could lead to gaps in the resulting image. Gaps result from eye motion during the OCT scanning process. Thus, it would be preferable to have an option to improve the results when using fewer scans, but still achieve levels of image quality similar to using more volumes.
The goal of this master’s thesis is to develop a method for denoising and inpainting of gaps in motioncorrected 3D-OCT images using supervised deep learning. We aim to improve the quality of images fused from fewer scans by training a denoiser with high quality scans that were combined and aggregated, using [6] and [7] as ground truth for our training.
The results will then be evaluated accordingly. Possible metrics for the evaluation of such a method could be structural similarity, peak signal-to-noise ratio or the contrast to noise ratio between the resulting image and the ground truth. Additionally, the correctness of inpainting will be evaluated by comparing the result to additional co-registered data that was not available to the image enhancement method.
In addition, this master’s thesis has the following requirements:
– literature research
– assembling of training and test sets with healthy data as well as data with different pathologies
– implementation of the method using a common deep learning framework
– submission of the method and the evaluation code
– Description of the performed work in a written thesis according to the lab’s thesis guidelines
– introductory and final presentation
References:
[1] Fujimoto J, Swanson E. “The Development, Commercialization, and Impact of Optical Coherence
Tomography.” In: Invest Ophthalmol Vis Sci. 2016 Jul 1;57(9):OCT1-OCT13, doi: 10.1167/iovs.16-19963.
PMID: 27409459; PMCID: PMC4968928.
[2] DuBose, Theodore B., et al.” Statistical models of signal and noise and fundamental limits of
segmentation accuracy in retinal optical coherence tomography.” In: IEEE transactions on medical
imaging, 2017, 37. Jg., Nr. 9, S. 1978-1988.
[3] Nienhaus, J., Matten, P., Britten, A. et al. “Live 4D-OCT denoising with self-supervised deep learning.”
In: Sci Rep 13, 5760 (2023), doi: 10.1038/s41598-023-32695-1
[4] Lehtinen, J. Noise2Noise: Learning Image Restoration without Clean Data. arXiv preprint
arXiv:1803.04189, 2018.
[5] Won, Jungeun, et al. “Topographic Measurement of the Subretinal Pigment Epithelium Space in
Normal Aging and Age-Related Macular Degeneration Using High-Resolution OCT.” In: Investigative
Ophthalmology & Visual Science, 2024, 65. Jg., Nr. 10, S. 18-18.
[6] Ploner, Stefan, et al. “A spatiotemporal model for precise and efficient fully-automatic 3d motion
correction in oct.” In: International Conference on Medical Image Computing and Computer-Assisted
Intervention. Cham: Springer Nature Switzerland, 2022. S. 517-527, doi: 10.1007/978-3-031-16434-7_50
[7] Ploner, Stefan, et al. “A spatiotemporal illumination model for 3d image fusion in optical coherence
tomography.”, In: 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI). IEEE, 2023. S. 1-
5., doi: 10.1109/ISBI53787.2023.10230526.
Exploring the Neural Representation of Language: A Comparative Analysis of MEG Brain Activity and Word Embeddings
Enhancing Lithium-Ion Batteries Safety
Generation of Region-guided Clinical Text Reports from Chest X-Ray Images Using LLMs
Foundation Models for Glacier Segmentation
This thesis aims to evaluate three state-of-the-art foundation models for the task of semantic segmentation,
specifically targeting the segmentation of glaciers calving fronts in SAR1 imagery. Foundation
models are recognized as general-purpose, task-agnostic models that are pre-trained on extensive
datasets, allowing them to be adapted for specific tasks with minimal additional training[1][2][3]. This
research will explore the efficacy of these models when applied to SAR data, which presents unique
challenges due to its complex imaging characteristics. The models selected for this analysis are based
on their performance metrics, methodologies, and the datasets used. To assess the suitability of learned
features for our CaFFe2[4] dataset, the models will be compared quantitatively and qualitatively with
each other and shall be implemented in pytorch. This involves fine-tuning the decoders for calving
front delineation tasks versus only fine-tuning the classifier head within backend frozen features.
• Foundation model 1, DINOv2 [1]: DINOv2, developed by Meta AI, represents a significant
advancement in self-supervised learning for computer vision applications. This model employs
a transformer-based architecture and utilizes a teacher-student training paradigm to facilitate
learning general-purpose visual features without needing labeled data. A critical aspect of
DINOv2 is its emphasis on scaling both the model and the dataset. Unlike previous foundation
models, DINOv2 maintains strict control over data quality and diversity, essential for producing
effective visual representations. For evaluation purposes, we focus on the CaFFe dataset and
assess at least one reported model trained on the ADE20K[5] and Pascal VOC 2012[6] datasets.
• Foundation model 2, Prithvi [2]: Prithvi, developed by IBM and NASA, represents a pioneering
foundational model specifically tailored for geospatial data. This model has been tested
across a variety of Earth observation tasks. It uses Masked Autoencoder technique with a Vision
Transformer architecture. Prithvi leverages multispectral satellite imagery from the Harmonized
Landsat Sentinel-2 (HLS) dataset, which offers high-resolution data suitable for diverse ecological
analyses. The model incorporates statistical factors such as precipitation and temperature,
minimizing bias towards specific landscapes and reducing redundancy across different regions
and time periods. For evaluation, this study will utilize the CaFFe dataset and assess at least one
of the three pre-trained models focused on flood mapping[7], wildfire scar mapping[8], and crop
segmentation[9].
• Foundation model 3, SMLFR [3]: SMLFR3 model is a generative convolutional neural network
designed for analyzing remote sensing data. Like Prithvi, SMLFR uses Masked AutoEncoder
technique, but it is built on a convolutional architecture called ConvNeXt[10], which is an
updated version of traditional ConvNets inspired by transformers and competes well with
transformers regarding accuracy and scalability. In addition, it improves feature representation
during training by applying high-frequency filtering to images. The SMLFR model is trained on
a geographical dataset collected from various sensors, including Sentinel-2, Gaofen, Landsat,
and QuickBird, and contains images from different continents and environments. This study will
evaluate the model on the CaFFe dataset using at least one of the two pre-trained models trained
on the Potsdam2[11] and LoveDA[12] datasets.
References
[1] Maxime Oquab et al. “DINOv2: Learning Robust Visual Features without Supervision”. In:
(2024). arXiv: 2304.07193 [cs.CV]. URL: https://arxiv.org/abs/2304.
07193.
[2] Johannes Jakubik et al. “Foundation Models for Generalist Geospatial Artificial Intelligence”.
In: (2023). arXiv: 2310.18660 [cs.CV]. URL: https://arxiv.org/abs/2310.
18660.
[3] Zhe Dong, Yanfeng Gu, and Tianzhu Liu. “Generative ConvNet Foundation Model With Sparse
Modeling and Low-Frequency Reconstruction for Remote Sensing Image Interpretation”. In:
IEEE Transactions on Geoscience and Remote Sensing 62 (2024), pp. 1–16. DOI: 10.1109/
TGRS.2023.3348479.
[4] N. Gourmelon et al. “Calving fronts and where to find them: a benchmark dataset and methodology
for automatic glacier calving front extraction from synthetic aperture radar imagery”. In:
Earth System Science Data 14.9 (2022), pp. 4287–4313. DOI: 10.5194/essd-14-4287-
2022. URL: https://essd.copernicus.org/articles/14/4287/2022/.
[5] Bolei Zhou et al. “Scene Parsing through ADE20K Dataset”. In: (2017), pp. 5122–5130. DOI:
10.1109/CVPR.2017.544.
[6] Mark Everingham et al. “The pascal visual object classes (VOC) challenge”. en. In: Int. J.
Comput. Vis. 88.2 (June 2010), pp. 303–338.
[7] Derrick Bonafilia et al. “Sen1Floods11: a georeferenced dataset to train and test deep learning
flood algorithms for Sentinel-1”. In: 2020 IEEE/CVF Conference on Computer Vision and
Pattern Recognition Workshops (CVPRW). 2020, pp. 835–845. DOI: 10.1109/CVPRW50498.
2020.00113.
[8] NASA IBM.Wildfire Scar Mapping Dataset. URL: https://huggingface.co/datasets/
ibm-nasa-geospatial/hls%20burn%20scars.
[9] NASA IBM. Multi-Temporal Crop Segmentation. URL: https://huggingface.co/
datasets/ibm-nasa-geospatial/multi-temporal-crop-classification.
[10] Zhuang Liu et al. A ConvNet for the 2020s. 2022. arXiv: 2201.03545 [cs.CV]. URL:
https://arxiv.org/abs/2201.03545.
[11] BSF Swissphoto. 2D Semantic Labeling Contest – Potsdam. URL: http://web.archive.
org / web / 20080207010024 / http : / / www . 808multimedia . com / winnt /
kernel.htm.
[12] Junjue Wang et al. LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive
Semantic Segmentation. 2022. arXiv: 2110.08733 [cs.CV]. URL: https://arxiv.
org/abs/2110.08733.
Stammering Identification using Large Language Models
Master Thesis – Annotation by Speech in Radiology
This thesis explores using speech as a direct annotation modality for medical image analysis, bypassing transcription errors and enabling more lightweight models. By training a foundation model like CLIP, we aim to investigate how well speech-based annotations perform compared to text.
Tasks:
- Generate a synthetic speech dataset based on a publicly available image-text dataset
- Train a foundation model (CLIP) for annotating medical images using speech annotations
- Evaluation of the foundation model on multiple downstream tasks like:
- Zero-shot classification
- Zero-shot segmentation using MedSam
- Speech Grounding (align language with corresponding visual elements, e.g. segmentation masks)
- Evaluation of the model on a real-world high-quality dataset from radiologists
- Compare the results to an image-text model
Requirements:
- Experience with PyTorch
- Hands-on experience with training deep learning models
- Experience with Natural Language Processing (optional)
- Experience with using SLURM for job management in a GPU cluster (optional)
- Deep Learning lecture
- Pattern Recognition/Analysis lecture (optional)
Application: (Applications that do not follow the application requirements will not be considered)
- CV
- Transcript of Records
- Short motivation letter (not longer than one page)
- Email Subject: “Application Speech-CLIP” + your full name
Please send an email with your application documents to lukas.buess@fau.de
Starting Date: 01.01.2025 or later
Related Works:
[1] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PMLR.
[2] Hamamci, I. E., Er, S., Almas, F., Simsek, A. G., Esirgun, S. N., Dogan, I., … & Menze, B. (2024). Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography.
[3] Ma, J., He, Y., Li, F., Han, L., You, C., & Wang, B. (2024). Segment anything in medical images. Nature Communications, 15(1), 654.