Geometry-Aware Key-Point / Object Detection and Pose-Estimation

For a wide range of emerging applications an increasing demand for reliable and accurate object detection and pose estimation, using machine learning based systems arises. This is particularly the case for autonomous systems such as autonomous vehicles and robotics but also in the context of augmented reality [1]. These applications require detecting and locating objects in real-time and in various environments, including cluttered scenes and objects with similar appearances.

However, traditional object detection and pose estimation methods often can only partially detect and locate objects in these challenging situations, leading to inaccurate and unreliable results [2]. This is where Geometry-Aware Key-Point / Object Detection and Pose-Estimation comes in, as it aims to explicitly incorporate additional geometric information into these tasks to improve their accuracy and robustness. In object detection, the goal is to identify the presence and location of objects within an image or video. Pose estimation, on the other hand, refers to estimating the position and orientation of objects in 3D space based on 2D images. By including human domain knowledge in the form of geometric constraints, we would like to utilize the knowledge of domain experts to create more robust and accurate solutions by simultaneously reducing the labeling effort associated with training data-driven solutions for novel applications.

There are various approaches to incorporating geometric information into object detection and pose estimation tasks. One common approach is to use geometry-aware convolutional neural networks (Geo-CNNs) [4], which are designed to incorporate geometric information into the model architecture explicitly. Another approach is to use geometry-aware scene graph generation [5], which uses a graph-based representation to model the geometric relationships between objects in a scene. However, our approach depends on the task at hand, object shape and orientation variability, scene complexity, and we would like to utilize the knowledge of domain experts to create more robust and accurate solutions by simultaneously reducing the labeling effort associated with training data-driven solutions for novel applications. An assessment of existing methods according to those requirements is part of the literature review corresponding to the proposed work. Afterwards, a potential adaptation of an existing method or the design and implementation of a novel approach and the corresponding evaluation should be the central task of the work.

Evaluation will be performed on industrial object detection use-case with high requirements on robustness and performance. The use-case considered for evaluating the proposed method is given by the detection of pallets in the context of an autonomous pallet unloading application. For this work it is planned to start from a public data set [7] and afterwards try to transfer results to our use-case and data, which is at least partially already collected. The thesis shall be carried out within a time period of six months including the literature review.

[1] Realtime 3D Object Detection for Automated Driving Using Stereo Vision and Semantic Information
[2] Viewpoint-Independent Object Class Detection using 3D Feature Maps
[3] Unsupervised 3D Pose Estimation With Geometric Self-Supervision
[4] Modeling Local Geometric Structure of 3D Point Clouds using Geo-CNN
[5] A Comprehensive Survey of Scene Graphs: Generation and Application
[6] From Points to Parts
[7] GitHub – tum-fml/loco: Home of LOCO, the first scene understanding dataset for logistics.
[8] Nothing But Geometric Constraints
[9] DeepIM: Deep Iterative Matching for 6D Pose Estimation