Evaluation of SHViT for volumetric Semantic Segmentation in Industrial CT Scans

Industrial computed tomography (iCT) is a widely applied tool in non-destructive testing, material analysis, quality control, and metrology. Semantic segmentation of industrial CT data plays a central role in these applications by enabling quality inspection, material differentiation and part separation [1]. While convolutional neural networks (CNNs) have traditionally performed well in segmentation tasks by capturing local structures, their limited ability to model long-range dependencies poses challenges in complex 3D datasets.

Transformer-based models have recently emerged as promising alternatives. By dividing the input into patches and using self-attention mechanisms, transformers can model global dependencies. However, early vision transformers had difficulties capturing spatial structure and learning from limited data. The Swin Transformer was one of the first models to address these issues by introducing a hierarchical structure and shifted windows, combined with an inductive bias that improves generalization on small datasets [2].

Despite these advances, transformers remain resource intensive. New models such as the Shifted-window Hierarchical Vision Transformer (SHViT) aim to reduce computational costs while maintaining performance. SHViT extends the Swin architecture and improves spatial modeling and efficiency through a refined hierarchical structure with shifted windows [3].

This thesis focuses on the implementation and evaluation of a volumetric SHViT model for 3D semantic segmentation. The model is tested on a real-world dataset of industrial CT scans of boxed shoes, which includes several segmentation tasks: separating the shoes from their surroundings and identifying individual components such as the insole, outsole, and upper [4]. Typically for industrial CT data, the dataset is limited in size. Yet, its structural variability makes it an interesting benchmark for assessing model generalization. As evaluation metric for the class imbalanced segmentation dataset primarily the F₁-score is used. The network is also evaluated in terms of memory and computational resource use.

The SHViT model will be compared to a CNN-based baseline, evaluating accuracy, robustness, and computational efficiency in the context of 3D industrial segmentation. While the study aims to inform the selection of neural architectures for iCT applications, its conclusions are limited using a single dataset. Nonetheless, SHViT shows potential for broader use in iCT, as it could enable the efficient application of transformer-based models to volumetric segmentation across diverse industrial datasets.

Literature

[1]	S. a. G. P. a. V. P. a. D. W. Bellens, „Machine learning in industrial X-ray computed tomography–a review,“ CIRP Journal of Manufacturing Science and Technology, pp. 324–341, 2024.
[2]	Z. a. L. Y. a. C. Y. a. H. H. a. W. Y. a. Z. Z. a. L. S. a. G. B. Liu, „Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,“ in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, CA, 2021.
[3]	S. a. R. Y. Yun, „SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design,“ Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5756–5767, 2024.
[4]	M. Leipert, G. Herl, J. Stebani, S. Zabler und A. Maier, „Three Step Volumetric Segmentation for Automated Shoe Fitting,“ e-Journal of Nondestructive Testing, Bd. 28, Nr. 3, 2023.