A Unified Framework for Depth-Assisted Monocular Object Pose Estimation
Monocular Depth Estimation (MDE) and Object Pose Estimation (OPE) are important tasks in visual scene understanding. Traditionally, these challenges have been addressed independently, with separate deep neural networks designed for each task. However, we contend that MDE, which provides information...
Saved in:
Published in: | IEEE access Vol. 12; pp. 111723 - 111740 |
---|---|
Main Authors: | , , , , , , , , , , , , , , |
Format: | Journal Article |
Language: | English |
Published: |
IEEE
2024
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Monocular Depth Estimation (MDE) and Object Pose Estimation (OPE) are important tasks in visual scene understanding. Traditionally, these challenges have been addressed independently, with separate deep neural networks designed for each task. However, we contend that MDE, which provides information about object distances from the camera, and OPE, which focuses on determining precise object position and orientation, are inherently connected. Combining these tasks in a unified approach facilitates the integration of spatial context, offering a more comprehensive understanding of object distribution in three-dimensional space. Consequently, this work addresses both challenges simultaneously, treating them as a multi-task learning problem. Our proposed solution is encapsulated in a Unified Framework for Depth-Assisted Monocular Object Pose Estimation. Leveraging Red-Green-Blue (RGB) images as input, our framework estimates pose of multiple object instances alongside an instance-level depth map. During training, we utilize both depth and color images, but during inference, the model relies exclusively on color images. To enhance the depth-aware features crucial for robust object pose estimation, we introduce a depth estimation branch supervised by depth images. These features undergo further refinement through a cross-task attention module, contributing to the innovation of our method in significantly improving feature discriminability and robustness in object pose estimation. Through extensive experiments, our approach demonstrates competitive performance compared to state-of-the-art methods in object pose estimation. Moreover, our method operates in real-time, underscoring its efficiency and practical applicability in various scenarios. This unified framework not only advances the state of the art in monocular depth estimation and object pose estimation but also underscores the potential of multi-task learning for enhancing the understanding of complex visual scenes. |
---|---|
ISSN: | 2169-3536 2169-3536 |
DOI: | 10.1109/ACCESS.2024.3443148 |