Accurate multi-organ segmentation in abdominal CT scans is
essential for computer-aided diagnosis and treatment. While convolutional
neural networks (CNNs) have long been the standard approach in medical
image segmentation, transformer-based architectures have recently gained
attention due to their ability to model long-range dependencies. In this study,
we systematically benchmark the three hybrid transformer-based models
UNETR, SwinUNETR, and UNETR++ against a strong CNN baseline,
SegResNet, for volumetric multi-organ segmentation on the heterogeneous
RATIC dataset. The dataset comprises 206 annotated CT scans from 23 institutions
worldwide, covering five abdominal organs. All models were trained
and evaluated under identical preprocessing and training conditions using the
Dice Similarity Coefficient (DSC) as the primary metric. The results show
that the CNN-based SegResNet achieves the highest overall performance,
outperforming all hybrid transformer-based models across all organs. Among
the transformer-based approaches, UNETR++ delivers the most competitive
results, while UNETR demonstrates notably faster convergence with fewer
training iterations. These findings suggest that, for small- to medium-sized
heterogeneous datasets, well-optimized CNN architectures remain highly
competitive and may outperform hybrid transformer-based designs.