Investigating the class-imbalance problem using deep learning techniques on real industry printed circuit board data

Investigating the class-imbalance problem using deep learning
techniques on real industry printed circuit board data
Quality control tasks in industry provide the ideal environment for the application of machine learning
due to large volumes of machine-generated data. However, some of the collected data is heavily
unbalanced or even unlabelled, since labelling the data is very labour- and cost-intensive. The objective
of this work is to investigate and apply deep learning methods to address this problem. For this
purpose, a real industry printed circuit board data set will be utilized, which is provided by Continental
corporation.
The first part of this work is a literature review in order to investigate available methods to overcome the
mentioned class-imbalance problem. The emphasis of this review is set to three different subsections:
The first group of methods will deal with synthetic over-sampling, which is a mechanism to enforce
the generation of data points in the convex hull of the intended underrepresented classes. A concrete
application to achieve this would be the Polarity-GAN approach proposed in [1]. In order to make use
of unlabelled data, semi-supervised learning approaches are examined next. The starting point for this
will be the work of Hyun et al. [2], who looked into available semi-supervised deep learning methods
for class-imbalances. The last subsection will deal with classical deep learning methods addressing the
class-imbalance problem, such as the Focal Loss [3].
After a detailed review, the investigated methods will be implemented and applied to the real industry
use case. For this purpose, the data pre-processing and sampling will be fixed to ensure reproducibility
across all experiments. Furthermore, the baseline against which all experiment results are compared
will be a ResNet50 architecture [4]. With a fixed framing and baseline, the performance of all acquired
methods will be evaluated using the real industry data. In addition, a possibility will be sought to
combine the various methods in such a way that the classification performance will become more
robust.
The thesis consists of the following milestones:
• Literature review to acquire possible methods regarding the class-imbalance problem
• Implement fixed machine learning pipeline to ensure reproducible experiments
• Apply found methods to the fixed framing and evaluate performance
• Evaluate performance against a ResNet50
The implementation will be done in Python with the help of PyTorch [5].

References
[1] Kumari Deepshikha and Anugunj Naman. Removing class imbalance using polarity-gan: An uncertainty
sampling approach, 2020.
[2] Minsung Hyun, Jisoo Jeong, and Nojun Kwak. Class-imbalanced semi-supervised learning. CoRR,
abs/2002.06815, 2020.
[3] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object
detection, 2018.
[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition,
2015.
[5] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen,
Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary
DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and
Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach,
H. Larochelle, A. Beygelzimer, F. d’Alch´e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural
Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.