Diffusion-Based Hypotheses Generation and Joint-Level Hypotheses Aggregation for 3D Human Pose Estimation

To combine the advantages of deterministic and probabilistic 3D human pose estimation methods, we decompose pose estimation into two processes: hypotheses generation and hypotheses aggregation. For hypotheses generation, we propose a novel Diffusion-based 3D Pose generation (D3DP) method. D3DP gener...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on circuits and systems for video technology p. 1
Main Authors: Shan, Wenkang, Zhang, Yuhuai, Zhang, Xinfeng, Wang, Shanshe, Zhou, Xilong, Ma, Siwei, Gao, Wen
Format: Journal Article
Language:English
Published: IEEE 14-06-2024
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:To combine the advantages of deterministic and probabilistic 3D human pose estimation methods, we decompose pose estimation into two processes: hypotheses generation and hypotheses aggregation. For hypotheses generation, we propose a novel Diffusion-based 3D Pose generation (D3DP) method. D3DP generates a diversified group of plausible 3D pose hypotheses from a single 2D keypoint observation. Utilizing a diffusion process, it gradually transforms ground-truth 3D poses towards a random distribution, subsequently employing a conditioned denoiser guided by the observed keypoints to recover the uncorrupted 3D poses. Moreover, D3DP is compatible with existing deterministic 3D pose estimators and allows users to optimize the trade-off between computational efficiency and pose accuracy via two adjustable parameters. For hypotheses aggregation, we propose two alternative approaches: a Reprojection-Based Selection (RBS) method and a Hypotheses Selection Network (HSN). These methods adopt the joint-level strategy to assemble multiple hypotheses generated by D3DP into a single 3D pose for practical use. Specifically, RBS reprojects 3D pose hypotheses to the 2D camera plane, and selects the best hypothesis based on the reprojection errors. HSN evaluates each hypothesis and selects the hypothesis with the highest confidence score as the output. Then these selected joints are combined into the final pose. The proposed methods implement a joint-by-joint aggregation strategy that capitalizes on the 2D prior and temporal information, both of which have been ignored by previous pose-level methods. Extensive experiments on two benchmarks highlight that the proposed method outperforms the state-of-the-art deterministic and probabilistic approaches.
ISSN:1051-8215
1558-2205
DOI:10.1109/TCSVT.2024.3415348