Multi-Modal Hand-Object Pose Estimation With Adaptive Fusion and Interaction Learning

Hand-object configuration recovery is an important task in computer vision. The estimation of pose and shape for both hands and objects during interactive scenarios has various applications, particularly in augmented reality, virtual reality, or imitation-based robot learning. The problem is particu...

Full description

Saved in:
Bibliographic Details
Published in:IEEE access Vol. 12; pp. 54339 - 54351
Main Authors: Hoang, Dinh-Cuong, Tan, Phan Xuan, Nguyen, Anh-Nhat, Vu, Duy-Quang, Vu, Van-Duc, Nguyen, Thu-Uyen, Hoang, Ngoc-Anh, Phan, Khanh-Toan, Tran, Duc-Thanh, Nguyen, Van-Thiep, Duong, Quang-Tri, Ho, Ngoc-Trung, Tran, Cong-Trinh, Duong, Van-Hiep, Ngo, Phuc-Quan
Format: Journal Article
Language:English
Published: Piscataway IEEE 2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Hand-object configuration recovery is an important task in computer vision. The estimation of pose and shape for both hands and objects during interactive scenarios has various applications, particularly in augmented reality, virtual reality, or imitation-based robot learning. The problem is particularly challenging when the hand is interacting with objects in the environment, as this setting features both extreme occlusions and non-trivial shape deformations. While existing works treat the problem of estimating hand configurations (that is pose and shape parameters) in isolation from the recovery of parameters related to the object acted upon, we stipulate that the two problems are related and can be solved more accurately concurrently. We introduce an approach that jointly learns the features of hand and object from color and depth (RGB-D) images. Our approach fuses appearance and geometric features in an adaptive manner which allows us to accent or suppress features that are more meaningful for the upstream task of hand-object configuration recovery. We combine a deep Hough voting strategy that builds on our adaptive features with a graph convolutional network (GCN) to learn the interaction relationships between the hand and held object shapes during interaction. Experimental results demonstrate that our proposed approach consistently outperforms state-of-the-art methods on popular datasets.
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2024.3388870