Index
Dynamic Technology trend monitoring from unstructured data using Machine learning
New technologies are enablers for product and process innovations. However, in the multitude of available technologies on the market, identifying the relevant and new technologies for one’s own company and one’s own problem is associated with a high effort. ROKIN as a technology platform offers a key component for the rapid identification of new technologies and thus for the acceleration of innovation processes in companies. For this purpose, new technologies are identified in the Internet, profiles are created and these are made available to companies via an online platform. Companies are provided with suitable solution proposals for their specific problem.
ROKIN automates the individual steps for this process, from data collection via web crawler, through the matching process, to the visualization of information in technology profiles. A central point in this process is detecting newest technological trends in the market in the collected data. This allows companies to keep up with upcoming technological shifts.
Due to the recent successes with so-called “Transformer Models” (e.g. “Bidirectional Encoder Representations from Transformers” (BERT)), new possibilities in the recognition and understanding of texts are opening up like never before. These models were trained domain-independent using general information from Wikipedia and book corpus. An open question is how these approaches perform in a domain-specific context like engineering. Can the sentiment understanding of such algorithms be used to improve existing classical NLP keyword analysis and topic modelling for trend detection? Especially the early onset of a trend, where little evidence through keywords is given a sentiment understanding using transformer based approaches might help. The goal is therefore to implement and extend existing classical NLP algorithms with Transformer models and use the new model to identify trends in big amount of engineering text documents.
Tasks:
• Literature research and analysis of existing NLP tools for trend detection (transformers as well as classic keyword analysis and topic modelling approaches).
• Setting up an information database (via Web-Crawling and Google Search APIs) for a given problem out of the engineering environment of a company (topic provided by ROKIN).
• Semantic modelling and analysis of the information database for identifying technology trends by different approaches of NLP algorithms.
• Strengths and weaknesses evaluation in respect to the created algorithms and based on the individual results.
• Development of a strategy or approach for an ideal trend detection strategy. Specific to early stage trend detection.
• Evaluation and optimization of the algorithms and documentation of the results.
Machine-Learning-Based Status Monitoring of HVDC Converter Stations
Detection and semantic segmentation of human faces in low resolution thermal images
The detection and isolation of persons with elevated body core temperatures contribute to the reduction of the speed with which certain respiratory diseases are spreading throughout the population. Contactless temperature measurements with thermal cameras are used for fast screening of persons and for selecting those that should be checked more closely with accurate medical thermometers. The only accessible source for temperature information is typically only the face in public areas and its exposed skin segments. The offset between the person’s actual body core temperature and the skin temperature varies in a wide range. It depends on the ambient conditions, what the person did in the last few minutes and where he/ she came from, on the person’s body characteristics and of course it depends on the location of the observed skin segment, not to speak of any technical limitations from the camera itself. Currently the Bosch Sicherheitssysteme Engineering GmbH is investigating the dependency of the body core temperature offset on the location of the measured skin segment.
In this master thesis, a reasonable detection and semantic segmentation of the human face on a thermal image should be investigated. In order to do this, the following points shall be addressed:
– Literature research for state-of-the-art methods of face detection within thermal images
– Identification of the most effective method for the exact face position detection within a preselected image area, including prototypical implementation (e.g. with OpenCV)
– Preparation and annotation of thermal image data for the usage of face detection
– Comparison of a neural network based method for face detection and classical machine learning approaches for the application to low resolution thermal images, eventually including prototypical implementation
– Identification of the most promising methods to correlate a hotspot pixel location with a face section (chin, cheek, nose, forehead, etc.), including prototypical implementations
– Optional: Identification of the most promising methods to detect certain facial occlusions like facial hair (forehead, beard) or glasses, including prototypical implementation
As input for the investigation, existing field test data is available for analysis, but further dedicated lab experiments will certainly be required.
Quality Assurance and Clinical Integration of a Prototype for Intelligent 4DCT Sequence Scanning
With 1.8 million deaths worldwide in 2018 (353.000 deaths in Europe in 2012 [1]) lung cancer is the most deadly cancer disease [2]. The prognosis for lung cancer are quite poor, only 15% of the men (21% of the women) survive 5 years [3] .
75% of these patients receive radiation therapy [4]. Nevertheless, it is challenged by breathing-related movements which lead to artifacts possibly causing both incorrect diagnosis and dosimetric errors of the therapy itself. As a result, the target volume might not be covered by the scheduled amount of radiation.
Computed tomography (CT) is an essential part of the treatment planning process. While 3D CT images can correctly display static anatomy, 4D imaging additionally records the breathing cycles and synchronizes it retrospectively with the acquired images. Thus the results of a 4D CT scan are time-resolved data of a 3D volume.
4D CT imaging with fixed beam on/ off slots and irregular breathing can lead to missing data coverage in desired breathing states, known as a violation of the data sufficiency condition (DSC). [5] The caused artifacts are expressed in the image as a strong blurring of anatomical structures and requires in the worst case a second treatment planning CT and as a consequence a delay of patient treatment as well as additional dose.
The idea of the intelligent 4D CT (i4DCT)-algorithm is to improve data coverage to reduce these artifacts. During the initial learning period the patient-specific respiratory cycle is analyzed. For every slice the scanner generates data for a whole respiratory cycle. Based on an online comparison of reference and current breathing curves during data acquisition, the selection of beam on/off periods is adjusted. If the data sufficiency condition is fulfilled the scan is stopped and the table moves to the next z-position. This process is repeated until the targeted scan area is covered. [5]
To ensure the effective, safe and reliable use of i4DCT-algorithm in everyday clinical practice quality assurance must be given.
The aim of this Master’s thesis is to develop and perform quality tests. Subsequently results are evaluated and interpreted to draw conclusions for clinical application.
Phantom measurements are performed with the CIRS Motion Thorax Phantom (CIRS, Norfolk, USA). This is a lung-equivalent solid epoxy rod containing a soft tissue target (symbolizing the tumor). In order to get close to realistic circumstances, the target can be moved by CIRS Motion Software according to an artificially created, irregular breathing pattern in three dimensions. The breathing curve is tracked by the Varian ‘respiratory gating for scanners’ system (RGSC, Varian Medical Systems, Inc. Palo Alto, CA). It consists of two main parts. All measurements are performed on SOMATOM go Open Pro CT scanner (Siemens Healthcare, Forchheim, Germany).
The tests include different reconstruction methods (Maximum Intensity Projection and amplitude/ phase based reconstruction), investigating the dimensions of the artificial tumor in every body axis, verifying the match of recorded breathing pattern in RGSC and CT as well as testing the limits of RGSC/ i4DCT algorithm.
Refernces
[1] J. Ferlay, E. Steliarova-Foucher, J. Lortet-Tieulent, S. Rosso, J. W. W. Coebergh, H. Comber, D. Forman und F. I. Bray, „Cancer incidence and mortality patterns in Europe: Estimates for 40 countries in 2012,“ European Journal of Cancer, Bd. 49, Nr. 6, pp. 1374-1403, 2013.
[2] World Health Organisation (WHO), „Cancer,“ 2018. [Online]. Available:
https://www.who.int/news-room/fact-sheets/detail/cancer. [Zugriff am 10 09 2020].
[3] Zentrum für Krebsregisterdaten, „Lungenkrebs (Bronchialkarzinom),“ 17 12 2019. [Online]. Available:
https://www.krebsdaten.de/Krebs/DE/Content/Krebsarten/Lungenkrebs/lungenkrebs_node.html. [Zugriff am 10 09 2020].
[4] R. Werner, „Strahlentherapie atmungsbewegter Tumoren: Bewegungsfeldschätzung und Dosisakkumulation anhand von 4D-Bilddaten,“ Springer Vieweg, 2013, p. 1.
[5] R. Werner, T. Sentker, F. Madesta, T. Gauer und C. Hofman, „Intelligent 4D CT sequence scanning (i4DCT): Concept and performance,“ Medical Physics, Nr. 46, pp. 3462-3474, 22 May 2019.
A Robust Intrusive perceptual audio quality assessment based on convolutional neural network
Abstract
The goal of a perceptual audio quality predictor is to capture the auditory
experience of listeners and score the audio excerpts without creating
massive workload for the listeners. Methods such as PESQ and ViSQOL
serve as computational proxy for subjective listening tests. ViSQOLAudio,
the Virtual Speech Quality Objective Listener in audio mode, is
a signal-based, full-reference, intrusive metric that models human audio
quality perception using a gammatone spectro-temporal measure of
similarity between a reference and a degraded audio signal. Here we
proposed an end-to-end model based on convolutional neural network
with self-attention mechanism to predict the perceived quality of audio
with a clean reference signal and improve its robustness to adversarial
examples. The model is trained and evaluated on a corpus of an unencoded
48kHz audio dataset up to 12 hours labeled by the ViSQOLAudio
to derive a Mean Opinion Score (MOS) for each excerpt.
Keywords: perceptual audio quality assessment, MOS, ViSQOLAudio,
full reference, deep learning, self-attention, end-to-end model
Introduction
Digital audio systems and services use codec to encode and decode a digital
data stream or signal in order to minimize bandwidth and maximize users’
quality of experience. Different codec brings with different quality degradation
and artefacts, which affect the perceived audio quality. To evaluate
the codec performance, a MOS score is used by asking listeners to assess the
quality of an audio clip on a scale from one to five. This method could be tedious
and expensive and several computational approaches to automate these
tests are designed to predict MOS. Intrusive method, i.e. with a full reference
signal, calculates a perceptually weighted distance between the clean (unencoded)
reference and degraded (coded) signals. PEAQ, POLQA, PEMO-Q and
1
Figure 1: A representation of ViSQOLAudio
ViSQOLAudio are four major full-reference models. ViSQOLAudio, which
will be the focus and inspiration of this Thesis, is an adapted model of ViSQOL
to function as a perceptual audio quality prediction model. ViSQOLAudio
introduces a series of novel improvements and has gained outstanding performance
against POLQA, PEAQ and PEMO-Q. Inspired and motivated by
ViSQOLAudio, we designed an end-to-end deep learning network to predict
MOS using gammatone spectrograms as input, which resembles the algorithm
of ViSQOLAudio and improves prediction performance and robustness to
adversarial examples.
Background
The process of ViSQOLAudio consists of four phases: preprocessing, pairing,
comparison and finally the similarity measure to a MOS mapping. In the
preprocessing stage, the middle channel of reference and degraded signals is
extracted, misalignment caused by zero padding is removed and then gammatone
spectrograms are calculated on both signals. Gammatone filters are a
popular linear approximation to the filtering performed by human auditory
system, and the audio signal is visualized as a time-varying distribution of
energy in frequency, which is one way of describing the information brains get
from the ears via auditory nerves. Conventional spectrogram differs from how
the sound is analyzed by ears. Ears’ frequency sub-bands get wider for higher
frequencies whereas the usual spectrogram keeps a constant bandwidth across
all frequency channels.
The pairing step first segments spectrograms of reference signals into a
sequence of patches of size 32 frequency bands times 30 frames (i.e., a 32 x 30
matrix). Then the patches of the same size in degraded signals are iteratively
extracted to calculate reference-degraded distances and create a set of most
similar reference-degraded patch pairs. The similarity of each pair is then
calculated in the comparison step and averaged across all the frequency bands.
In the last step the mean frequency band similarity score is mapped to MOS
using a support vector regression model.
Dataset
The dataset used by Microsoft team for full reference speech quality evaluation
is 16kHz sampled, 2010 clean speech samples up to 20 seconds long with 3
utterances, approximately 33 hours in total. The speech data of the attentional
2
Siamese neural networks are collected from 11 different databases from the
POLQA pool with 5000 clean reference signals up to 16 hours. Building a
dataset between 10 to 30 hours would be adequate as well as efficient for
unbiased computation in our case.
We collected 48kHz sampled mono audio files to build our clean reference
dataset, which consists of 4500 music excerpts and 900 speech excerpts and
each excerpt is exactly 8 seconds long and totally adds up to 12 hours. The
reference audio clips are then encoded and decoded by HE-AAC and AAC
codec with the following sequence of bitrates: 16, 20, 24, 32, 48, 64, 96, and
128 kbps: 16, 20, 24, 32, and 48 kbps was encoded with HE-AAC and 64, 96,
and 128 kbps with plain AAC. Coding above 128 kbps will be hardly audibly
different from un-coded signals and coding below 16 kbps will greatly reduce
the audio quality and make no sense in common practical applications. 43,200
degraded signals are generated from 5400 clean reference signals and expected
to be labelled ideally as 8 different quality intervals corresponding to coded
bitrates.
The reference and degraded signals are then paired and aligned and later
fed into ViSQOLAudio to get MOS as their corresponding ground truth labels
instead of human annotated MOS scores. Gammatone spectrograms
of reference and degraded signals are extracted based on the MATLAB implementation
of gammatone spectrogram presented by Daniel Ellis. This
MATLAB implementation is running inside the ViSQOLAudio. The gammatone
spectrogram of the audio signal is calculated with a window size of 80ms,
hop size of 20ms, and 32 frequency bands from 50Hz up to half of the sample
rate. The gammatone spectrograms of reference and degraded signals are
paired and concatenated channel-wise in the shape of [channels, time frames,
frequency bands] and later used as inputs to our neural network.
Architecture
The existing deep learning architectures in speech and audio quality assessment
generally consist of CNN blocks, RNN blocks or attention layers. The
model proposed by Microsoft team consists of several convolutional layers,
batch normalization layers, max pooling layers and fully connected layers with
dropout. Other models such as attentional Siamese neural networks proposed
by Gabriel Mittag and Sebastian Moeller adds LSTM layers and attention
layers to include the influence of the features from long time sequence.
Self-attention was proposed by Google in 2017 to apply in natural language
processing without RNN. The essence of attention mechanism is when
human sight or hearing detects an item, it will not scan the entire scene or
excerpt end to end, rather it focuses on a specific portion according to their
needs. Attention mechanism was designed to dynamically create a weights
matrix between keys and queries. This weight matrix could be applied to the
feature maps or original input spatial-wise or channel-wise. Interesting and
promising applications of attention mechanism in computer vision involves
refined classification, image segmentation and image caption. Compared to
conventional classification tasks implemented by CNN, attention module adds
a parallel branch consisting of successive down-sampling and up-sampling
3
operations to gain a wider receptive field. The attention map increases the
range of receptive field from the lower layers and highlights the core features
that are crucial to classification tasks.
Apart from conventional convolutional layers, attention layers as well
as squeeze-and-excitation net (SENet) will be attempted and utilized in our
model. While normal self-attention layers are applied spatial-wise, SENet
is a special attention mechanism, which applies different weights channelwise.
The appropriate design and parameters of the architecture remain to be
discussed and tested in further work.
Conclusion
Although state-of-the-art methods have proposed a few intrusive deep learning
models learning from waveform, spectrogram or other transformed features,
most of models were trained on 16kHz speech signals and none of those
use gammatone spectrograms as input. Our model is the first end-to-end
neural network trained on the gammatone spectrograms derived from 48kHz
audio dataset predicting MOS. Perceptual audio quality assessment is still a
brand new and promising application of deep learning algorithms and the
versatility and impact of this work is huge.
References
1. Michael Chinen, Felicia S. C. Lim, Jan Skoglund, Nikita Gureev, Feargus
O’Gorman and Andrew Hines, “ViSQOL v3: an open source production
ready objective speech and audio metric”, arXiv:2004.09584v1[eess.AS] 20.
Apr. 2020
2. Colm Sloan, Naomi Harte, Damien Kelly, Anil C. Kokaram and Andrew
Hines, “Objective assessment of perceptual audio quality using ViSQOLAudio”,
IEEE Transactions on Broadcasting, Vol. 63. No. 4. Dec. 2017
3. Hannes Gamper, Chandan K. A. Reddy, Ross Cutler, Ivan J. Tashev, and
Johannes Gehrke, “Intrusive and non-intrusive perceptual speech quality
assessment using a convolutional neural network”, 2019 IEEEWorkshop
on Applications of Signal Processing to Audio and Acoustics
4. Gabriel Mittag, and Sebastian Moeller “Full-reference speech quality
estimation with attentional siamese neural networks”, 978-1-5090-6631-
5/20/$31.00 2020 IEEE, ICASSP 2020
4
Classification of Breast Density in Mammograms Using Deep Machine Learning
The female breast is mainly composed of adipose and fibroglandular tissue. In a mammogram, fibroglandular tissue appears brighter than fatty tissue and is therefore called “dense”. Current clinical protocol requires radiologists to not only detect possible cancer tumors but also to evaluate breast density in a mammogram \cite{Wockel.2018}, which corresponds to the relative amount of fibroglandular tissue. Breast density is an important characteristic of a mammogram because it is a breast cancer risk marker and it affects the mammogram’s sensitivity. The evaluation is done via classification into one of the four categories defined by the “Breast Imaging – Reporting and Data System” guidelines from the American College of Radiology (ACR BI-RADS).
In this thesis, the application of convolutional neural networks for the classification of breast density in mammograms is investigated. Several neural network architectures and training methods are tested and the results compared against classical machine learning methods. A strategy for the removal of possibly noisy labels in the training data is presented and an analysis of inter-observer variability among radiologists is carried out. It is found that the algorithm with the best classification performance provides breast density assessment on level with an average experienced radiologist.
COPD Classification in CT Images Using a 3D Convolutional Neural Network
Chronic obstructive pulmonary disease (COPD) is a lung disease that is not fully reversible and one of the leading causes of morbidity and mortality in the world. Early detection and diagnosis of COPD can increase the survival rate and reduce the risk of COPD progression in patients. Currently, the primary examination tool to diagnose COPD is spirometry. However, computed tomography (CT) is used for detecting symptoms and sub-type classification of COPD. Using different imaging modalities is a difficult and tedious task even for physicians and is subjective to inter-and intra-observer variations. Hence, developing methods that can automatically classify COPD versus healthy patients is of great interest. In this thesis we propose a 3D deep learning approach to classify COPD and emphysema using volume-wise annotations only. We also investigate the impact of transfer learning on the classification of emphysema using knowledge transfer from a pre-trained COPD classification model.
Tumor Detection & Classification in Breast Cancer Histology Images using Deep Neural Networks
Among females, breast cancer is one of the most frequently diagnosed cancers and the leading causes of cancer-related death both worldwide, and in more economically developed countries. Early diagnosis significantly increases treatment success, since the treatment is more difficult and uncertain when the disease is detected at advanced stages. For this purpose, proper analysis of histology images is essential. Histology is the study of the microanatomy of cells, tissues, and organs as seen through a microscope.
One of the most common type of Histology images used as the basis of contemporary cancer diagnosis for at least a century is Hematoxylin and eosin (H&E) stained breast histol- ogy microscopy images[4]. During this diagnosis procedure, trained specialists evaluate both overall and local tissue organization of the images. However, due to the large amount of data and the complexity of the images, this task becomes very time consuming and non-viable. Therefore, the development of software tools for automatic detection and diagnostic tools is a promising prospect in this field. This subject has been a rather active field of research, and thus, the automatic detection of breast cancer based on histology images is part of the ICIAR 2018 challenge on BreAst Cancer Histology (BACH) challenge. This challenge consists of two parts; classification and segmentation.
The aim of this thesis is to first design a classifier network, which can recognize types of breast cancer. Then, using another network, we will try to classify the WSIs and perform segmentation on the images. Afterwards, we want to investigate how weakly-supervised training can affect our results on both image-wise (first part) and pixel-wise labeled images (second part). For this purpose, we will start with reproducing the results of the winning paper, which is the state of the art. Then we try to build the rest on top of that.
Deep Learning-based Denoising of Mammographic Images using Physics-driven Data Augmentation
Mammography uses low-energy X-rays to screen the human breast and is used by radiologists to detect breast cancer. Due to its complexity, a radiologist needs an impeccable image quality. For this reason, the possibility of using deep learning to denoise Mammograms to help radiologists detect breast cancer more easily will be examined. In this thesis, we aim to investigate and develop different deep learning methods for mammogram denoising.
A physically motivated noise model will be simulated on the ground truth images to generate training data. Thereafter the variance stabilizing Anscombe transformation is applied to create white Gaussian noise. Using these data, different network architectures are trained and examined. For training, a novel loss function will be designed which helps to preserve fine image details crucial for breast cancer detections.
The effectiveness of this loss function is investigated, and its performance is compared again to other state-of-the-art loss functions. It can be shown that the proposed method outperforms state of the art algorithm like BM3D for mammography denoising. Finally, it will be shown that the network is able to remove not only simulated, but also real noise.
Solution to extend the Field of View of Computed Tomography using Deep Learning approaches
Deep learning has been successfully applied in various applications of computed tomography (CT). Due to limited
detector size and low dose requirements, the problem of data truncation is essentially present in CT. The reconstructed images from such limited field-of-view (FoV) projections suffer from cupping artifacts inside the FoV and distortion or missing of anatomical structures that are outside the FoV [1]. One practical approach to solve the data truncation problem is to apply an extrapolation technique that increases the FoV, then apply an artifact removal technique. The water cylinder extrapolation based reconstruction [2] is a promising method that estimates the projections outside the scan field-of-view (SFoV) by using the knowledge from the projections inside the SFoV. Alternatively, the linear extrapolation technique is the simplest extrapolation technique that always increases the FoV without using any prior information, however, artifacts are still visible on the reconstructed image. Recently, Fourni´e et al. [3] have proposed a deep learning based method “Deep EFoV” to extend the FoV of CT images. First, the FoV is increased by linearly extrapolating the outer channels in the sinogram space. The reconstructed image from this extended FoV sinogram produces artifacts. Finally, the U-net model is used to remove the artifacts in the reconstructed image. The reconstructed image from a neural network model might affect the anatomical structures that are inside the SFoV. To compensate this effect, a standard algorithm “HDFoV” is used where projections inside the SFoV and projections from the neural network model that are outside the FoV are merged.
The aim of the master’s thesis will be to integrate “Deep EFoV” and “HDFoV” algorithms in the C#-based proprietary
reconstruction tool “ReconCT” developed by Siemens Healthineers. The result from the integrated algorithms needs
to be compared with the result from only the “Deep EFoV” algorithm. Another goal is to evaluate and improve the
proposed deep learning model in “Deep EFoV” for the CT FoV extension. The model needs to be improved w. r. t.
tweaking architecture, adapting parameters or even using a different architecture. The dataset and software provided
by Siemens Healthineers will be used in the thesis. The final software needs to be integrated into the “ReconCT” and
has to be presented to the supervisors.
The thesis will include the following points:
• Review of the state-of-the-art method and deep learning approaches to extend the FoV
• Comparison of the proposed method “Deep EFoV” with the integrated “Deep EFoV” and “HDFoV” method
• Improvement and simplification of the proposed deep learning model in “Deep EFoV”
• Integration of the proposed model in the reconstruction tool.
References
[1] Y. Huang, L. Gao, A. Preuhs, and A. Maier, “Field of View Extension in Computed Tomography Using Deep
Learning Prior,” in Bildverarbeitung f¨ur die Medizin: Algorithmen – Systeme – Anwendungen, pp. 186–191,
Springer, 2020.
[2] J. Hsieh, E. Chao, J. Thibault, B. Grekowicz, A. Horst, S. McOlash, and T. J. Myers, “A novel reconstruction
algorithm to extend the CT scan field-of-view,” Medical Physics, vol. 31, no. 9, pp. 2385–2391, 2004.
[3] ´ E. Fourni´e, M. Baer-Beck, and K. Stierstorfer, “CT field of view extension using combined channels extension
and deep learning methods,” in International Conference on Medical Imaging with Deep Learning – Extended
Abstract Track, (London, United Kingdom), 08–10 Jul 2019.