Deep learning-based for human segmentation and tracking, 3D human pose estimation and action recognition on monocular video of MADS dataset
Human segmentation and tracking (HS-T) in the video often utilize person detection results. In addition, 3D human pose estimation (3D-HPE) and human activity recognition (HAR) often use human segmentation results to reduce data storage and computational time. With recent advantages of deep learning,...
Saved in:
Published in: | Multimedia tools and applications Vol. 82; no. 14; pp. 20771 - 20818 |
---|---|
Main Author: | |
Format: | Journal Article |
Language: | English |
Published: |
New York
Springer US
01-06-2023
Springer Nature B.V |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Human segmentation and tracking (HS-T) in the video often utilize person detection results. In addition, 3D human pose estimation (3D-HPE) and human activity recognition (HAR) often use human segmentation results to reduce data storage and computational time. With recent advantages of deep learning, especially using Convolutional Neural Networks (CNNs), there are excellent results in these relevant tasks. Consequently, they can be applied to building many practical applications such as sports analysis, sports scoring, health protection, teaching, and preserving traditional martial arts. In this paper, we performed a survey of relevant studies, methods, datasets, and results for HS-T, 3D-HPE, and HAR. We also deeply analyze the results of detecting persons as it affects the results of human segmentation and human tracking. The survey is performed in great detail up to source code paths. The MADS (Martial Arts, Dancing, and Sports) dataset comprises fast and complex activities. It has been published for the task of estimating human pose. However, before determining the human pose, the person needs to be detected as a segment in the video, especially the 3D human pose annotation data is different from the point cloud data generated from RGB-D images. Therefore, we have also prepared 2D human pose annotation data on the 28k images for creating 3D human pose annotation and action labeling data. Moreover, we also evaluated the MADS dataset with many recently published deep learning methods for human segmentation (Mask R-CNN, PointRend, TridentNet, TensorMask, and CenterMask) and tracking, 3D-HPE (RepNet, MediaPipe Pose, and Lifting from the Deep, V2V-PoseNet), and HAR (ST-GCN, DD-net, and PA-GesGCN) in the video. All data and published results are available. |
---|---|
ISSN: | 1380-7501 1573-7721 |
DOI: | 10.1007/s11042-022-13921-w |