Analysis of Transformer's Attention Behavior in Sleep Stage Classification and Limiting It to Improve Performance

The transformer architecture has been focused on many tasks like natural language processes, vision tasks and etc. The most important and general requirement of using the transformer-based architecture is that the model must be trained on a large-scale dataset before it can be fine-tuned for a speci...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE access Vol. 12; pp. 95914 - 95925
Main Authors:	Kim, Dongyoung, Ko, Young-Woong, Kim, Dong-Kyu, Lee, Jeong-Gun
Format:	Journal Article
Language:	English
Published:	Piscataway IEEE 2024 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:	Analytical models Brain modeling Classification Classification algorithms Computer architecture Computer vision Convolutional neural networks Data models Datasets Deep learning Electroencephalography Electromyography Feature extraction Object recognition Recurrent neural networks Sleep Sleep stage classification Task analysis Training transformer Transformers
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	The transformer architecture has been focused on many tasks like natural language processes, vision tasks and etc. The most important and general requirement of using the transformer-based architecture is that the model must be trained on a large-scale dataset before it can be fine-tuned for a specific task like classification, object detection and etc. However, in this paper, we find that the transformer architecture has better generalization capability to capture the features from data samples for sleep stage classification than CNN-based architectures, despite using a small-scale dataset without pretraining on large-scale dataset. This outcome contradicts the widely-held belief that a transformer architecture is more effective when trained on large datasets. In this paper, we investigate the attention behavior of a transformer model and demonstrate how global and local attentions influence an attention map in a transformer architecture. Finally, through experiments, we show that restricting global attention using 'Masked Multi-Head Self-Attention (M-MHSA)' can lead to improved model generalization in sleep stage classification compared with the previous methodologies and original transformer-based architecture on three different datasets.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2024.3424236