Self-supervised learning for pathology classification

Motivation
Self-supervised learning is a promising approach in the field of speech processing. The capacity to learn
representations from unlabelled data with minimal feature-engineering efforts results in increased
independence from labelled data. This is particularly relevant in the pathological speech domain, where the
amount of labelled data is limited. However, as most research focuses on healthy speech, the effect of selfsupervised
learning on pathological speech data remains under-researched. This motivates the current
research as pathological speech will potentially benefit from the self-supervised learning approach.
Proposed Method
Self-supervised machine learning helps make the most out of unlabeled data for training a model. Wav2vec
2.0 will be used, an algorithm that almost exclusively uses raw, unlabeled audio to train speech
representations [1][2]. These can be used as input feature alternatives to traditional approaches using Mel-
Frequency Cepstral Coefficients or log-mel filterbanks for numerous downstream tasks. To evaluate the
performance of these trained representations, it will be examined how well they perform on a binary
classification task where the model predicts whether or not the input speech is pathological.
A novel database containing audio files in German collected using the PEAKS software [3] will be used.
Here, patients with speech disorders, such as dementia, cleft lip, and Alzheimer’s Disease, were recorded
performing two different speech tasks: picture reading in !!Psycho-Linguistische Analyse Kindlicher Sprech-
Störungen” (PLAKSS) and “The North Wind and the Sun” (Northwind) [3]. As the database is still being
revised, some pre-processing of the data must be performed, for example, removing the voice of a (healthy)
therapist from the otherwise pathological recordings. After preprocessing, the data will be input to the
wav2vec 2.0 framework for self-supervised learning, which will be used as a pre-trained model in the
pathology classification task.
Hypothesis
Given the benefits of acquiring learned representations without labelled data, the hypothesis is that the selfsupervised
model’s classification experiment will outperform the approach without self-supervision. The
results of the pathological speech detection downstream task are expected to show the positive effects of
pre-trained representations obtained by self-supervised learning.
Furthermore, the model is expected to enable automatic self-assessment for the patients using minimallyinvasive
methods and assist therapists by providing objective measures in their diagnosis.
Supervisions
Professor Dr. Andreas Maier, Professor Dr. Seung Hee Yang, M. Sc. Tobias Weise
References
[1] Schneider, S., Baevski, A., Collobert, R., Auli, M. (2019) wav2vec: Unsupervised Pre-Training for
Speech Recognition. Proc. Interspeech 2019, 3465-3469
[2] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised
Learning of Speech Representations,” in Advances in Neural Information Processing Systems. 2020, vol. 33,
pp. 12449–12460, Curran Associates, Inc.
[3] Maier, A., Haderlein, T., Eysholdt, U., Rosanowski, F., Batliner, A., Schuster, M., Nöth, E., Peaks – A
System for the automatic evaluation of voice and speech disorders, Speech Communication (2009)