A Unified Framework for Depth-Assisted Monocular Object Pose Estimation

Monocular Depth Estimation (MDE) and Object Pose Estimation (OPE) are important tasks in visual scene understanding. Traditionally, these challenges have been addressed independently, with separate deep neural networks designed for each task. However, we contend that MDE, which provides information...

Full description

Saved in:
Bibliographic Details
Published in:IEEE access Vol. 12; pp. 111723 - 111740
Main Authors: Hoang, Dinh-Cuong, Xuan Tan, Phan, Nguyen, Thu-Uyen, Pham, Hai-Nam, Nguyen, Chi-Minh, Bui, Son-Anh, Duong, Quang-Tri, Vu, van-Duc, Nguyen, van-Thiep, Duong, van-Hiep, Hoang, Ngoc-Anh, Phan, Khanh-Toan, Tran, Duc-Thanh, Ho, Ngoc-Trung, Tran, Cong-Trinh
Format: Journal Article
Language:English
Published: IEEE 2024
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Monocular Depth Estimation (MDE) and Object Pose Estimation (OPE) are important tasks in visual scene understanding. Traditionally, these challenges have been addressed independently, with separate deep neural networks designed for each task. However, we contend that MDE, which provides information about object distances from the camera, and OPE, which focuses on determining precise object position and orientation, are inherently connected. Combining these tasks in a unified approach facilitates the integration of spatial context, offering a more comprehensive understanding of object distribution in three-dimensional space. Consequently, this work addresses both challenges simultaneously, treating them as a multi-task learning problem. Our proposed solution is encapsulated in a Unified Framework for Depth-Assisted Monocular Object Pose Estimation. Leveraging Red-Green-Blue (RGB) images as input, our framework estimates pose of multiple object instances alongside an instance-level depth map. During training, we utilize both depth and color images, but during inference, the model relies exclusively on color images. To enhance the depth-aware features crucial for robust object pose estimation, we introduce a depth estimation branch supervised by depth images. These features undergo further refinement through a cross-task attention module, contributing to the innovation of our method in significantly improving feature discriminability and robustness in object pose estimation. Through extensive experiments, our approach demonstrates competitive performance compared to state-of-the-art methods in object pose estimation. Moreover, our method operates in real-time, underscoring its efficiency and practical applicability in various scenarios. This unified framework not only advances the state of the art in monocular depth estimation and object pose estimation but also underscores the potential of multi-task learning for enhancing the understanding of complex visual scenes.
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2024.3443148