Philipp Klumpp

Symbolic picture for the article. The link opens the image in a large view.

June 5, 2024

Phonetic Transfer Learning from Healthy References for the Analysis of Pathological Speech

Automatic speech recognition (ASR) experienced a major leap forward through many advances in deep machine learning over the past decade. In each new iteration of neural ASR architecture, the increase in performance was accompanied by an increasing number of model parameters. Speech recognizers could be optimized with many thousands of hours of audio data to explore a vast amount of different speakers, acoustic conditions or recording setups, resulting in very robust decision boundaries. However, if the utilized training dataset was small and did not cover much variability, optimization attempts could quickly lead to an overfitted model. Compared to large general-purpose ASR datasets, pathological speech corpora are much smaller. Additionally, they were often recorded under very controlled conditions to serve as a valid reference for scientific experiments. Nevertheless, it is common practice to optimize large neural networks with small amounts of pathological speech and report results on a held-out test set of the same corpus: A valid strategy to evaluate and support a particular methodology, but the reported results often collapse under the presence of entirely new data. This thesis proposes an opposite approach: All models presented throughout this work had been optimized only on healthy speech from standard ASR datasets, with the base task of recognizing speech sounds from the International Phonetic Alphabet (IPA). These highly robust recognizers could then be used to analyze speech signals from different pathologies. For Parkinson’s Disease (PD), it was possible to estimate the available phonetic inventory of speakers as the condition progressed through various stages of severity. It could also be shown that any humanly perceived decrease of intelligibility also manifested in the recognition confidences of the proposed models, indicating a clear shift away from the well-explored healthy reference. PD greatly impacts the phonation of vowels: An increasing monotonicity manifested in a clear shrinkage of the estimated vowel triangles of severely affected patients. The estimated vowel space could also be used as a surrogate to track the severity of speech symptoms over time, with correlations of up to 0.81 compared to the expert annotation. For the analysis of Covid-19 speech, the proposed models served primarily as feature extractors. Each extracted feature vector was associated with a speech sound, thus it was possible to identify differences between speaker cohorts and directly relate them to particular structures in the vocal tract. Results showed a clear impact of the respiratory condition on the speech signal, but, unlike reported by many other studies, the differences between Covid-19 and other conditions, such as a common cold, were small in terms of effect size and almost never found to be significant. The proposed methods were not limited to speech, but could also be used for therapy of aphasia, a language disorder. Various keyword recognition models were proposed – grounded on the existing phone recognizers – which modeled an open vocabulary of target words via plain phoneme sequences. Attention-based models achieved F-scores above 0.9. Even for those patients who suffered from concomitant speech deficits, the recognizers reliably differentiated between correct and invalid keyword productions. While overcoming the lack of available pathological speech data, this thesis also highlighted a path towards more explainable neural networks and classifiers in general. Instead of directly mapping an input signal to an output prediction, the extraction of phonetic knowledge could help support any further decision making or evaluation, for both medical experts and algorithms.