Gaze Estimation Based on Convolutional Structure and Sliding Window-Based Attention Mechanism

The direction of human gaze is an important indicator of human behavior, reflecting the level of attention and cognitive state towards various visual stimuli in the environment. Convolutional neural networks have achieved good performance in gaze estimation tasks, but their global modeling capabilit...

Full description

Saved in:

Bibliographic Details
Published in:	Sensors (Basel, Switzerland) Vol. 23; no. 13; p. 6226
Main Authors:	Li, Yujie, Chen, Jiahui, Ma, Jiaxin, Wang, Xiwen, Zhang, Wei
Format:	Journal Article
Language:	English
Published:	Switzerland MDPI AG 07-07-2023 MDPI
Subjects:	Analysis Computational linguistics Computer architecture convolutional neural networks (CNN) Datasets deep learning gaze estimation Human acts Human behavior Language processing Mapping Methods Natural language interfaces Natural language processing Neural networks self-attention mechanism Spatial data swin transformer self-attention mechanism deep learning convolutional neural networks (CNN) gaze estimation swin transformer
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	The direction of human gaze is an important indicator of human behavior, reflecting the level of attention and cognitive state towards various visual stimuli in the environment. Convolutional neural networks have achieved good performance in gaze estimation tasks, but their global modeling capability is limited, making it difficult to further improve prediction performance. In recent years, transformer models have been introduced for gaze estimation and have achieved state-of-the-art performance. However, their slicing-and-mapping mechanism for processing local image patches can compromise local spatial information. Moreover, the single down-sampling rate and fixed-size tokens are not suitable for multiscale feature learning in gaze estimation tasks. To overcome these limitations, this study introduces a Swin Transformer for gaze estimation and designs two network architectures: a pure Swin Transformer gaze estimation model (SwinT-GE) and a hybrid gaze estimation model that combines convolutional structures with SwinT-GE (Res-Swin-GE). SwinT-GE uses the tiny version of the Swin Transformer for gaze estimation. Res-Swin-GE replaces the slicing-and-mapping mechanism of SwinT-GE with convolutional structures. Experimental results demonstrate that Res-Swin-GE significantly outperforms SwinT-GE, exhibiting strong competitiveness on the MpiiFaceGaze dataset and achieving a 7.5% performance improvement over existing state-of-the-art methods on the Eyediap dataset.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1424-8220 1424-8220
DOI:	10.3390/s23136226