Index

Animal-Independent Signal Enhancement Using Deep Learning

Examining and segmenting bioacoustic signals is an essential part of biology. For example, by analysing orca calls it is possible to draw several conclusions regarding the animals’ communication and social behavior.1 However, to avoid having to manually go through hours of audio material to detect those calls, the so-called ORCA-SPOT toolkit was developed, which uses Deep Learning to separate relevant signals from pure ambient sounds.2 These may nevertheless still contain background noise, which makes the examination rather difficult. To remove this background noise, ORCA-CLEAN was developed. Again, using a Deep Learning approach by adapting the Noise2Noise concept, as well as using machine-generated binary masks as an additional attention mechanism, the orca calls are denoised as best as possible without requiring clean data as foundation.3
But, as mentioned, this toolkit is optimized for the denoising of orca calls. Marine biologists are of course not the only ones who require clean audio signals for their research. Ornithologists alone deal with a great variety of different noise. One that studies urban bird species would like city sounds to be filtered from his audio samples, whereas one who works with tropical birds rather wants his recordings clean from forest noise. One could argue that almost every biologist who analyses recordings of animal calls would have use for a denoising tool kit.
Another task where audio denoising is of great relevance is when interpreting and processing human speech. It can be used to improve the sound quality of a phone call or a video conference, to preprocess a voice command to a virtual assistant on a smartphone, to improve voice recognition software and many other examples. Even in medicine it can help when analysing pulmonary auscultative signals, which is the key method to detect and evaluate respiratory dysfunctions.4
It therefore makes sense to generalize ORCA-CLEAN and to make it trainable for other animal sounds, perhaps even human speech, or body noises. One would then have a generalized version of ORCA-CLEAN, which can then be trained according to the desired purpose. The goal of this thesis will be to describe and explain the respective changes in the code, as well as to evaluate how differently trained models perform on audio recordings of different animals. The transfer from a model specialized on orcas to one specialized on another animal species will be demonstrated using recordings of hyraxes. The data used contains tapes of 34 hyrax individuals. For each individual there are multiple tapes available, and for each tape there is a corresponding table containing information like the exact location, the length, the peak frequency or the call type for each call on the tape.
The hyrax is a small hoofed mammal of the family of Procaviidae.5, 6 They usually weigh 4 to 5 kg, are about 30 to 50 cm long and are mostly herbivorous.5 Their calls, especially the advertisement calls, are helpful for distinguishing different hyrax species and for analysing the animals’ behaviour.6
Here are a few rough approaches how I would realize this thesis. I would begin by modifying the ORCA-CLEAN code. Since orca calls very much differ from hyrax ones in terms of frequency range as well as in length, the prepocessing of the audio tapes would have to be modified. I would also like to add some more input/output spectrogram variants to the training process.
One could use pairs of noisy and denoised human speech samples for example, or a pure noise spectrogram versus a completely empty one. The probability with which each of these variants is chosen could additionally be made variable.
After that, I would train different models with hyrax audio tapes, including original ORCA-CLEAN as well as an the newly created adaptions, and evaluate their performance. Since the provided hyrax tapes aren’t all equally noisy, they can be sorted by the so-called Signal to Noise Ratio (SNR). One can then compare these values before and after denoising, e.g., by correlating them, and check if the files were denoised correctly or if relevant parts were removed.
With help of these results further alterations can be made, for example by changing the probabilities of the training methods or by adapting the hyperparameters of the deep network, until hopefully in the end, a suitable network which doesn’t require huge amount of data is the result.
I hope I was able to give some insight into what I imagine the subject to be, and how I would roughly execute it.
Sources
[1] https://lme.tf.fau.de/person/bergler/#collapse_0
[2] C. Bergler, H. Schröter, R. X. Cheng, V. Barth, M. Weber, E. Nöth, H. Hofer, and A. Maier, “ORCA-SPOT: An Automatic Killer Whale Sound Detection Toolkit Using Deep Learning” Scientific Reports, vol. 9, 12 2019.
[3] C. Bergler, M. Schmitt, A. Maier, S. Smeele, V. Barth, and E. Nöth, “ORCA-CLEAN: A Deep Denoising Toolkit for Killer Whale Communication“ Interspeech 2020 (pp. 1136-1140). International Speech Communication Association.
[4] F. Jin and F. Sattar “Enhancement of Recorded Respiratory Sound Using Signal Processing Techniques“ In A. Cartelli, M. Palma (Eds.) “Encyclopedia of Information Communication Technology” (pp. 291-300), 2009.
[5] https://www.britannica.com/animal/hyrax
[6] https://www.wildsolutions.nl/vocal-profiles/hyrax-vocalizations/

reAction: Automatic Speech Recognition in German Automotive Domain

Deep Learning for Cancer Patient Survival Prediction Using 2D Portrait Photos Based on StyleGAN Embedding

Risk Classification of Brain Metastases via Deep Learning Radiomics

Simulation of Spike Artifact Obstructed MR Images for Machine Learning Methods

Automated Scoring of Rey-Osterrieth Complex Figure Test Using Deep Learning

Novel View Synthesis for Augmentation of Fine-Grained Image Datasets

Current deep-learning-based classification methods require large amounts of data for training, and in certain scenarios such as in the surveillance imaging there is only a limited amount of data. The aim of the research is to generate new training images of vehicles with the same characteristics as the training data but from novel view points and investigate its suitability for fine-grained classification of vehicles.

Generative models such as generative adversarial networks (GANs) [1] allow for customization of images. However, adjusting the perspective through methods such as conditional GANs for unsupervised image-to-image translation has proven to be particularly difficult [1]. Methods such as StyleGANs [2] or neural radiance fields (NeRFs) [3] are relevant approaches to generate images with different styles and perspectives.
StyleGAN is an extension to the GAN architecture that proposes changes to the generator model such as the introduction of a mapping network. The mapping network generates intermediate latent codes which are transformed into styles that is integrated at each point in the generator network. It also includes a progressive growing approach for training generator models capable of synthesizing very large high-quality images.
NeRF can generate novel views of complex 3D scenes based on a partial set of 2D images. It is trained to directly map from spatial location and viewing direction (5D input) to opacity and color, using volume rendering [4] to render new views.

The thesis consists of the following milestones:

Literature review on the state-of-the-art approaches for GAN- and neural radiance fields-based
image synthesis
Adoption of existing GAN- and neural radiance fields-based image synthesis methods to generate
car images using different styles and camera poses [5]
Experimental evaluation and comparison of different image synthesis methods
Investigate the suitability of the generated images for fine-grained vehicle classification using
different classification methods [6], [7]

The implementation will be done in Python.

References
[1] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, “Generative Adversarial Networks ”, in NIPS, 2014
[2] Tero Karras, Samuli Laine, Timo Aila, “A Style-Based Generator Architecture for Generative Adversarial Networks ”, in proceedings of the IEEE/CVF Conference on CVPR, 2019
[3] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, Ren Ng, “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis ”, in ECCV 2020
[4] Robert A. Drebin, Loren Carpenter, Pat Hanrahan, “Volume Rendering ”, in Proceedings of SIGGRAPH 1988
[5] Jiatao Gu, Lingjie Liu, Peng Wang, Christian Theobalt, “StyleNeRF: A Style-based 3D-Aware Generator for High-resolution Image Synthesis ”, in ICLR 2022
[6] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo, “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows ”, in IEEE/CVF conference on ICCV, 2021
[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, “Deep residual learning for image recognition ”, Proceedings of the IEEE conference on CVPR, 2016

Modelling of the breast during the mammography examination

Metal-conscious Transformer Enhanced CBCT Projection Inpainting

Computed tomography device (CT) is a means of tomographic imaging technology, and it has been developed
rapidly. Due to beam hardening effect, metallic artifacts occur and degrade the quality of CT images. Metal
artifacts have been the focus and difficulty in the field of CT imaging research because of their direct impact on
clinical diagnosis and the diversity of manifestations and causes [1]. In order to reconstruct metal-free CT
images, the inpainting task is an essential part.
The traditional method of inpainting replaces the metal-affected region of the projected data by interpolation
[2][3]. Recently, deep convolutional networks (CNNs) have shown strong potential in all computer vision tasks,
including image inpainting. Several approaches have been proposed for image restoration using CNN based
encoder-decoder network. Shift-Net based on U-Net architecture is one of these approaches, which has good
restoration accuracy in structure and texture [4]. Zeng et al. [5] built a pyramidal-context architecture called
PEN-NET for high-quality image inpainting. Liao et al. [6] proposed a new generative mask pyramid network
to reduce for CT/CBCT Metal Artifact Reduction. Although CNNs have many advantages, their field of
perception is usually small and not conducive to capturing global features. On the contrary, Vision Transformer
(ViT) uses attention to model long-term dependencies among image patches. The shifted window Transformer
(Swin Transformer) is proposed to adapt to the high resolution of images in vision tasks [8], taking into account
the translational invariance of CNNs, the perceptual field and the hierarchical relationship.
To overcome the shortage of medical image data and the domain shift problem in the field of deep learning, this
research is based on simulated X-ray images using ViT as the encoder and CNN as the decoder for image
inpainting. In order to further improve the inpainting performance, some variants of the backbone network are
considered, such as using Swin Transformer instead of ViT and adding the adversarial loss.
The paper will include the following points:
• Literature review in inpainting and metal artifacts reduction.
• Traditional method and CNN based model implementation.
• ViT-based model construction; parameter optimization and incorporation with adversarial loss; results
evaluation.
• Thesis writing.

References
[1] Netto, C., Mansur, N., Tazegul, T., Lalevee, M., Lee, H., Behrens, A., Lintz, F., Godoy-Santos, A., Dibbern,
K., Anderson, D. Implant Related Artifact Around Metallic and Bio-Integrative Screws: A CT Scan 3D
Hounsfield Unit Assessment. Foot & Ankle Orthopaedics. 7, 2473011421S00174 (2022)
[2] Kalender WA, Hebel R, Ebersberger J. Reduction of CT artifacts caused by metallic implants. Radiology.
1987 Aug;164(2):576-7. doi: 10.1148/radiology.164.2.3602406. PMID: 3602406.
[3] Meyer E, Raupach R, Lell M, Schmidt B, Kachelriess M. Normalized metal artifact reduction (NMAR) in
computed tomography. Med Phys. 2010 Oct;37(10):5482-93. doi: 10.1118/1.3484090. PMID: 21089784.
[4] Zhaoyi Yan, Xiaoming Li, Mu Li, Wangmeng Zuo, and Shiguang Shan. Shift-net: Image inpainting via
deep feature rearrangement, 2018.
[5] Yanhong Zeng, Jianlong Fu, Hongyang Chao, and Baining Guo. Learning pyramid-context encoder
network for high-quality image inpainting, 2019.
[6] Haofu Liao, Wei-An Lin, Zhimin Huo, Levon Vogelsang, William J Sehnert, S Kevin Zhou, and Jiebo Luo.
Generative mask pyramid network for ct/cbct metal artifact reduction with joint projection-sinogram
correction. In International Conference on Medical Image Computing and Computer-Assisted Intervention,
pages 77–85. Springer, 2019.
[7] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth
16×16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[8] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin
transformer: Hierarchical vision transformer using shifted windows, 2021.