A human activity recognition method based on Vision Transformer

Human activity recognition has a wide range of applications in various fields, such as video surveillance, virtual reality and human–computer intelligent interaction. It has emerged as a significant research area in computer vision. GCN (Graph Convolutional networks) have recently been widely used i...

Full description

Saved in:

Bibliographic Details
Published in:	Scientific reports Vol. 14; no. 1; pp. 15310 - 18
Main Authors:	Han, Huiyan, Zeng, Hongwei, Kuang, Liqun, Han, Xie, Xue, Hongxin
Format:	Journal Article
Language:	English
Published:	London Nature Publishing Group UK 03-07-2024 Nature Publishing Group Nature Portfolio
Subjects:	639/705/1041 639/705/117 Accuracy Algorithms Classification Computer applications Computer vision Datasets Deep learning Euclidean space Human Activities Human activity recognition Humanities and Social Sciences Humans Image Processing, Computer-Assisted - methods Imaging, Three-Dimensional - methods Methods multidisciplinary Neural Networks, Computer Pattern Recognition, Automated - methods Science Science (multidisciplinary) Semantics Skeleton data Spatio-temporal Virtual reality ViT Spatio-temporal ViT Skeleton data Human activity recognition
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Human activity recognition has a wide range of applications in various fields, such as video surveillance, virtual reality and human–computer intelligent interaction. It has emerged as a significant research area in computer vision. GCN (Graph Convolutional networks) have recently been widely used in these fields and have made great performance. However, there are still some challenges including over-smoothing problem caused by stack graph convolutions and deficient semantics correlation to capture the large movements between time sequences. Vision Transformer (ViT) is utilized in many 2D and 3D image fields and has surprised results. In our work, we propose a novel human activity recognition method based on ViT (HAR-ViT). We integrate enhanced AGCL (eAGCL) in 2s-AGCN to ViT to make it process spatio-temporal data (3D skeleton) and make full use of spatial features. The position encoder module orders the non-sequenced information while the transformer encoder efficiently compresses sequence data features to enhance calculation speed. Human activity recognition is accomplished through multi-layer perceptron (MLP) classifier. Experimental results demonstrate that the proposed method achieves SOTA performance on three extensively used datasets, NTU RGB+D 60, NTU RGB+D 120 and Kinetics-Skeleton 400.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	2045-2322 2045-2322
DOI:	10.1038/s41598-024-65850-3