Distillation Learning for Speech Enhancement

Type: MA thesis

Status: finished

Date: February 1, 2021 - June 30, 2021

Supervisors: Hendrik Schröter, Andreas Maier

Noise suppression has remained a field of interest for more than five decades now, and a number of techniques have been employed to extract clean and/or noise free data. Continuous audio and video signals offer greater challenges when it comes to noise reduction, and deep neural network (DNN) techniques have been designed to enhance those signals (Valin, 2018). While the DNNs are efficient, they are computationally expensive and demand adequate memory resources. The aim of the proposed thesis will remain on addressing these constraints when working limited memory and computational power, without compromising much on the model efficiency.

A Neural Network (NN) can easily be overfitted with the training data, owing to the large number of parameters and training sessions for which the network was trained on the given data (Dakwale & Monz, 2019). One solution to this is to use ensemble (combination) of models trained on the same data to achieve generalization. The limitation of this solution comes with hardware constraints and when the network needs to be used on a hardware with limited memory and computational power, such as mobile phones. This resource limitation seeds the idea of distillation learning, in which the knowledge from a complex or ensembled network is transferred to a relatively simpler and computationally less expensive model.

Following the framework of distillation learning, a Teacher-Student network will be designed, with an existing trained Teacher network. The teacher network has been trained on audio data with hard labels, using a dense parameter matrix. The high number of parameters dictates the complexity of the neural network and also the efficiency to identify and suppress signal noise (Hinton, et al., 2015). The proposed method is to design a student network, which tries to imitate the output of the teacher, i.e., the probability distribution, without the need to be trained with the same number of parameters. By transferring the learning of the teacher to the student network, a simpler model can be designed, with a reduced set of parameters, which would be more suited for hardware with lower memory and computational power.