Index
Evaluation of a Bayesian U-Net for Glacier Segmentation
Guided Attention Mechanism for Weakly-Supervised Breast Calcification Analysis
Thesis_Proposal_Akshat_SubmittedSelf-supervised learning for pathology classification
Motivation
Self-supervised learning is a promising approach in the field of speech processing. The capacity to learn
representations from unlabelled data with minimal feature-engineering efforts results in increased
independence from labelled data. This is particularly relevant in the pathological speech domain, where the
amount of labelled data is limited. However, as most research focuses on healthy speech, the effect of selfsupervised
learning on pathological speech data remains under-researched. This motivates the current
research as pathological speech will potentially benefit from the self-supervised learning approach.
Proposed Method
Self-supervised machine learning helps make the most out of unlabeled data for training a model. Wav2vec
2.0 will be used, an algorithm that almost exclusively uses raw, unlabeled audio to train speech
representations [1][2]. These can be used as input feature alternatives to traditional approaches using Mel-
Frequency Cepstral Coefficients or log-mel filterbanks for numerous downstream tasks. To evaluate the
performance of these trained representations, it will be examined how well they perform on a binary
classification task where the model predicts whether or not the input speech is pathological.
A novel database containing audio files in German collected using the PEAKS software [3] will be used.
Here, patients with speech disorders, such as dementia, cleft lip, and Alzheimer’s Disease, were recorded
performing two different speech tasks: picture reading in !!Psycho-Linguistische Analyse Kindlicher Sprech-
Störungen” (PLAKSS) and “The North Wind and the Sun” (Northwind) [3]. As the database is still being
revised, some pre-processing of the data must be performed, for example, removing the voice of a (healthy)
therapist from the otherwise pathological recordings. After preprocessing, the data will be input to the
wav2vec 2.0 framework for self-supervised learning, which will be used as a pre-trained model in the
pathology classification task.
Hypothesis
Given the benefits of acquiring learned representations without labelled data, the hypothesis is that the selfsupervised
model’s classification experiment will outperform the approach without self-supervision. The
results of the pathological speech detection downstream task are expected to show the positive effects of
pre-trained representations obtained by self-supervised learning.
Furthermore, the model is expected to enable automatic self-assessment for the patients using minimallyinvasive
methods and assist therapists by providing objective measures in their diagnosis.
Supervisions
Professor Dr. Andreas Maier, Professor Dr. Seung Hee Yang, M. Sc. Tobias Weise
References
[1] Schneider, S., Baevski, A., Collobert, R., Auli, M. (2019) wav2vec: Unsupervised Pre-Training for
Speech Recognition. Proc. Interspeech 2019, 3465-3469
[2] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised
Learning of Speech Representations,” in Advances in Neural Information Processing Systems. 2020, vol. 33,
pp. 12449–12460, Curran Associates, Inc.
[3] Maier, A., Haderlein, T., Eysholdt, U., Rosanowski, F., Batliner, A., Schuster, M., Nöth, E., Peaks – A
System for the automatic evaluation of voice and speech disorders, Speech Communication (2009)
Automatic identification of unremarkable Medical Images
Human interpretable Writer Retrieval and Verification
PowerPoint Presentation describer. Machine learning methods to automatically generate business captions from graphics
Detection of localized necking in Hydraulic Bulge Tests using Deep Learning Methods
Reinforcement Learning in Optimum Order Execution
Empathetic Deep Learning to the Rescue: Speech Emotion Recognition from Adults to Children
Emotional states are strong influential factors of humans’ choices, activities, and desires. They can be evaluated from face, self-observing reports and, what this thesis focuses on, speech. While there is some research done in speech emotion recognition it has less exploitation of deep learning approaches due to the field’s recentness and recent improvements in computational and optimizational approaches. In addition, the complicatedness of collecting improvised data, not from professional adult actors remains present in the state-of-the-art literature. Thus, the goal of this thesis is to explore the area of speech emotion recognition in children by testing the predominant approaches of neural networks with temporal prosody as well as abruptly expanding Transformers methods. We investigate the potential of transfer knowledge applied from adults’ to children’s data as the mechanism of dealing with lacking data. From the outcomes, we observe the improvement in the opportunities of transfer knowledge when gender and cultural aspects are included into the classification of emotions. Emotionally intelligent systems built based on the experiments described in the thesis can benefit the fields of remote monitoring or telemedicine for psychologists and pediatrists, teaching emotional intelligence for autistic children, and improving children’s health diagnostics and scanning procedures.
Classical Acoustic Markers for Depression in Parkinson’s Disease
Parkinson’s disease (PD) patients are commonly recognized for their tremors, although there is a wide range of different symptoms of PD. This is a progressive neurological condition, where patients do not have enough dopamine in the substancia nigra, which plays a role in motor control, mood, and cognitive functions. A really underestimated type of symptoms in PD is the mental and behavioral issues, which can manifest in depression, fatigue, or dementia. Clinical depression is a psychiatric mood disorder, caused by an individual’s difficulty in coping with stressful life events, and presents persistent feelings of sadness, negativity, and difficulty managing everyday responsibilities. This can be triggered by the lack of dopamine from PD, the upsetting and stressful situation of the Parkinson’s diagnosis as well as by the loneliness and isolation that can be caused by the Parkinson’s symptoms.
The goal of this work is to find the most suitable acoustic features that can discriminate against depression in Parkinson’s patients. Those features will be based on classical and interpretable acoustic descriptors.