Abdul Dost “Evaluation and Fusion of Vision-Language and Computer Vision Models for On-road Scenario Extraction in Autonomous Vehicles” (MT Intro Talk)
In automated vehicle research, validation is a major challenge. EU Regulation 2022/1426 for Level-4 vehicle approval highlights the need for evidence-based, scenario-driven (virtual) validation [1]. This particularly requires a method to identify and extract challenging scenarios—situations that are unexpected, complex, or safety-critical, demanding immediate responses from the automated driving system. At present, these scenarios are often derived through an expert-driven process, manually identified from real-world data; while effective, this is time- and cost-intensive and limits throughput [2], [3].
To address this, Vision-Language Models (VLMs) can be used, offering time and cost efficiency and providing contextual understanding of scenarios directly from image frames in video data [4]. However, VLMs can lack spatial accuracy and may produce reasoning errors or “hallucinations,” especially in complex road environments [5], [6]. When relying solely on VLMs, significant guidance is often needed in challenging situations, as they may struggle to interpret the actions of the ego vehicle or other traffic actors [5], [6].
To overcome these limitations, this thesis proposes integrating VLMs with traditional Computer Vision (CV) techniques. While CV models are strong in object detection, depth estimation and metric localization, they lack higher-order semantic reasoning [7]–[10]. VLMs, in contrast, can interpret scene context, object interactions, and terrain semantics [4]—but struggle with metric reasoning and spatial grounding [6]. Fusing the two offers a promising path forward.