This project investigates the generalization capabilities of prominent self-supervised vision models, such as DINOv2, CLIP, and MoCo, when applied to image retrieval tasks across diverse visual domains.
Overview
Self-supervised learning (SSL) models are increasingly important for creating robust visual features. However, their performance often degrades when transferring from standard natural image datasets (like ImageNet) to more specialized domains.
Our study systematically benchmarks these models on three distinct domains:
1. Natural Images (Baseline)
2. Scene-Centric Images (e.g., Places365)
3. Artistic Images (e.g., ArtPlaces)
We compare the models’ out-of-the-box generalization and analyze the impact of finetuning on their domain adaptation, providing crucial insights into the stability and robustness of the learned representations for practical retrieval applications.