A Robust Intrusive perceptual audio quality assessment based on convolutional neural network

Abstract
The goal of a perceptual audio quality predictor is to capture the auditory
experience of listeners and score the audio excerpts without creating
massive workload for the listeners. Methods such as PESQ and ViSQOL
serve as computational proxy for subjective listening tests. ViSQOLAudio,
the Virtual Speech Quality Objective Listener in audio mode, is
a signal-based, full-reference, intrusive metric that models human audio
quality perception using a gammatone spectro-temporal measure of
similarity between a reference and a degraded audio signal. Here we
proposed an end-to-end model based on convolutional neural network
with self-attention mechanism to predict the perceived quality of audio
with a clean reference signal and improve its robustness to adversarial
examples. The model is trained and evaluated on a corpus of an unencoded
48kHz audio dataset up to 12 hours labeled by the ViSQOLAudio
to derive a Mean Opinion Score (MOS) for each excerpt.
Keywords: perceptual audio quality assessment, MOS, ViSQOLAudio,
full reference, deep learning, self-attention, end-to-end model
Introduction
Digital audio systems and services use codec to encode and decode a digital
data stream or signal in order to minimize bandwidth and maximize users’
quality of experience. Different codec brings with different quality degradation
and artefacts, which affect the perceived audio quality. To evaluate
the codec performance, a MOS score is used by asking listeners to assess the
quality of an audio clip on a scale from one to five. This method could be tedious
and expensive and several computational approaches to automate these
tests are designed to predict MOS. Intrusive method, i.e. with a full reference
signal, calculates a perceptually weighted distance between the clean (unencoded)
reference and degraded (coded) signals. PEAQ, POLQA, PEMO-Q and
1
Figure 1: A representation of ViSQOLAudio
ViSQOLAudio are four major full-reference models. ViSQOLAudio, which
will be the focus and inspiration of this Thesis, is an adapted model of ViSQOL
to function as a perceptual audio quality prediction model. ViSQOLAudio
introduces a series of novel improvements and has gained outstanding performance
against POLQA, PEAQ and PEMO-Q. Inspired and motivated by
ViSQOLAudio, we designed an end-to-end deep learning network to predict
MOS using gammatone spectrograms as input, which resembles the algorithm
of ViSQOLAudio and improves prediction performance and robustness to
adversarial examples.
Background
The process of ViSQOLAudio consists of four phases: preprocessing, pairing,
comparison and finally the similarity measure to a MOS mapping. In the
preprocessing stage, the middle channel of reference and degraded signals is
extracted, misalignment caused by zero padding is removed and then gammatone
spectrograms are calculated on both signals. Gammatone filters are a
popular linear approximation to the filtering performed by human auditory
system, and the audio signal is visualized as a time-varying distribution of
energy in frequency, which is one way of describing the information brains get
from the ears via auditory nerves. Conventional spectrogram differs from how
the sound is analyzed by ears. Ears’ frequency sub-bands get wider for higher
frequencies whereas the usual spectrogram keeps a constant bandwidth across
all frequency channels.
The pairing step first segments spectrograms of reference signals into a
sequence of patches of size 32 frequency bands times 30 frames (i.e., a 32 x 30
matrix). Then the patches of the same size in degraded signals are iteratively
extracted to calculate reference-degraded distances and create a set of most
similar reference-degraded patch pairs. The similarity of each pair is then
calculated in the comparison step and averaged across all the frequency bands.
In the last step the mean frequency band similarity score is mapped to MOS
using a support vector regression model.
Dataset
The dataset used by Microsoft team for full reference speech quality evaluation
is 16kHz sampled, 2010 clean speech samples up to 20 seconds long with 3
utterances, approximately 33 hours in total. The speech data of the attentional
2
Siamese neural networks are collected from 11 different databases from the
POLQA pool with 5000 clean reference signals up to 16 hours. Building a
dataset between 10 to 30 hours would be adequate as well as efficient for
unbiased computation in our case.
We collected 48kHz sampled mono audio files to build our clean reference
dataset, which consists of 4500 music excerpts and 900 speech excerpts and
each excerpt is exactly 8 seconds long and totally adds up to 12 hours. The
reference audio clips are then encoded and decoded by HE-AAC and AAC
codec with the following sequence of bitrates: 16, 20, 24, 32, 48, 64, 96, and
128 kbps: 16, 20, 24, 32, and 48 kbps was encoded with HE-AAC and 64, 96,
and 128 kbps with plain AAC. Coding above 128 kbps will be hardly audibly
different from un-coded signals and coding below 16 kbps will greatly reduce
the audio quality and make no sense in common practical applications. 43,200
degraded signals are generated from 5400 clean reference signals and expected
to be labelled ideally as 8 different quality intervals corresponding to coded
bitrates.
The reference and degraded signals are then paired and aligned and later
fed into ViSQOLAudio to get MOS as their corresponding ground truth labels
instead of human annotated MOS scores. Gammatone spectrograms
of reference and degraded signals are extracted based on the MATLAB implementation
of gammatone spectrogram presented by Daniel Ellis. This
MATLAB implementation is running inside the ViSQOLAudio. The gammatone
spectrogram of the audio signal is calculated with a window size of 80ms,
hop size of 20ms, and 32 frequency bands from 50Hz up to half of the sample
rate. The gammatone spectrograms of reference and degraded signals are
paired and concatenated channel-wise in the shape of [channels, time frames,
frequency bands] and later used as inputs to our neural network.
Architecture
The existing deep learning architectures in speech and audio quality assessment
generally consist of CNN blocks, RNN blocks or attention layers. The
model proposed by Microsoft team consists of several convolutional layers,
batch normalization layers, max pooling layers and fully connected layers with
dropout. Other models such as attentional Siamese neural networks proposed
by Gabriel Mittag and Sebastian Moeller adds LSTM layers and attention
layers to include the influence of the features from long time sequence.
Self-attention was proposed by Google in 2017 to apply in natural language
processing without RNN. The essence of attention mechanism is when
human sight or hearing detects an item, it will not scan the entire scene or
excerpt end to end, rather it focuses on a specific portion according to their
needs. Attention mechanism was designed to dynamically create a weights
matrix between keys and queries. This weight matrix could be applied to the
feature maps or original input spatial-wise or channel-wise. Interesting and
promising applications of attention mechanism in computer vision involves
refined classification, image segmentation and image caption. Compared to
conventional classification tasks implemented by CNN, attention module adds
a parallel branch consisting of successive down-sampling and up-sampling
3
operations to gain a wider receptive field. The attention map increases the
range of receptive field from the lower layers and highlights the core features
that are crucial to classification tasks.
Apart from conventional convolutional layers, attention layers as well
as squeeze-and-excitation net (SENet) will be attempted and utilized in our
model. While normal self-attention layers are applied spatial-wise, SENet
is a special attention mechanism, which applies different weights channelwise.
The appropriate design and parameters of the architecture remain to be
discussed and tested in further work.
Conclusion
Although state-of-the-art methods have proposed a few intrusive deep learning
models learning from waveform, spectrogram or other transformed features,
most of models were trained on 16kHz speech signals and none of those
use gammatone spectrograms as input. Our model is the first end-to-end
neural network trained on the gammatone spectrograms derived from 48kHz
audio dataset predicting MOS. Perceptual audio quality assessment is still a
brand new and promising application of deep learning algorithms and the
versatility and impact of this work is huge.
References
1. Michael Chinen, Felicia S. C. Lim, Jan Skoglund, Nikita Gureev, Feargus
O’Gorman and Andrew Hines, “ViSQOL v3: an open source production
ready objective speech and audio metric”, arXiv:2004.09584v1[eess.AS] 20.
Apr. 2020
2. Colm Sloan, Naomi Harte, Damien Kelly, Anil C. Kokaram and Andrew
Hines, “Objective assessment of perceptual audio quality using ViSQOLAudio”,
IEEE Transactions on Broadcasting, Vol. 63. No. 4. Dec. 2017
3. Hannes Gamper, Chandan K. A. Reddy, Ross Cutler, Ivan J. Tashev, and
Johannes Gehrke, “Intrusive and non-intrusive perceptual speech quality
assessment using a convolutional neural network”, 2019 IEEEWorkshop
on Applications of Signal Processing to Audio and Acoustics
4. Gabriel Mittag, and Sebastian Moeller “Full-reference speech quality
estimation with attentional siamese neural networks”, 978-1-5090-6631-
5/20/$31.00 2020 IEEE, ICASSP 2020
4