LGSNet: A Two-Stream Network for Micro- and Macro-Expression Spotting With Background Modeling

Micro- and macro-expression spotting in an untrimmed video is a challenging task, due to the mass generation of false positive samples. Most existing methods localize higher response areas by extracting hand-crafted features or cropping specific regions from all or some key raw images. However, thes...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on affective computing Vol. 15; no. 1; pp. 1 - 18
Main Authors: Yu, Wang-Wang, Jiang, Jingwen, Yang, Kai-Fu, Yan, Hong-Mei, Li, Yong-Jie
Format: Journal Article
Language:English
Published: Piscataway IEEE 01-01-2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Micro- and macro-expression spotting in an untrimmed video is a challenging task, due to the mass generation of false positive samples. Most existing methods localize higher response areas by extracting hand-crafted features or cropping specific regions from all or some key raw images. However, these methods either neglect the continuous temporal information or model the inherent human motion paradigms (background) as foreground. Consequently, we propose a novel two-stream network, named Local suppression and Global enhancement Spotting Network (LGSNet), which takes segment-level features from optical flow and videos as input. LGSNet adopts anchors to encode expression intervals and selects the encoded deviations as the object of optimization. Furthermore, we introduce a Temporal Multi-Receptive Field Feature Fusion Module (TMRF <inline-formula><tex-math notation="LaTeX">^{3}</tex-math></inline-formula> M) and a Local Suppression and Global Enhancement Module (LSGEM), which help spot short intervals more precisely and suppress background information. To further highlight the differences between positive and negative samples, we set up a large number of random pseudo ground truth intervals (background clips) on some discarded sliding windows to accomplish background clips modeling to counteract the effect of non-expressive face and head movements. Experimental results show that our proposed network achieves state-of-the-art performance on the CAS(ME)<inline-formula><tex-math notation="LaTeX">^{2}</tex-math></inline-formula>, CAS(ME)<inline-formula><tex-math notation="LaTeX">^{3}</tex-math></inline-formula> and SAMM-LV datasets.
ISSN:1949-3045
1949-3045
DOI:10.1109/TAFFC.2023.3266808