The Segment Anything Model (SAM), developed by Meta AI in 2023, introduced a powerful zero-shot segmentation approach, allowing object segmentation without additional training. In 2024, Meta released SAM2, an upgraded version designed for video object tracking using a hierarchical vision transformer and a cross-attention mechanism with memory integration. Despite its advancements, SAM2 lacks category-specific segmentation, limiting its ability to distinguish objects based on contextual understanding.
This research aims to enhance SAM2 by leveraging its automatic mask generator (AMG) and memory system to improve object recognition. The proposed method involves generating segmentation candidates using AMG and employing cross-attention mechanisms to compare feature similarities, enabling consistent identification of objects across frames. Additionally, to address SAM2’s challenges with light reflection and over-segmentation, small-scale fine-tuning of the image encoder and mask decoder will be implemented.
The study will include an analysis of SAM2’s architecture, the development of an embedding-based object identification method, and performance benchmarking using the ARMBench dataset and a custom industrial dataset. The findings will contribute to improving SAM2’s capability in complex environments, particularly in robotic applications requiring precise object classification and tracking.