LGSNet: A Two-Stream Network for Micro- and Macro-Expression Spotting With Background Modeling
Micro- and macro-expression spotting in an untrimmed video is a challenging task, due to the mass generation of false positive samples. Most existing methods localize higher response areas by extracting hand-crafted features or cropping specific regions from all or some key raw images. However, thes...
Saved in:
Published in: | IEEE transactions on affective computing Vol. 15; no. 1; pp. 1 - 18 |
---|---|
Main Authors: | , , , , |
Format: | Journal Article |
Language: | English |
Published: |
Piscataway
IEEE
01-01-2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Micro- and macro-expression spotting in an untrimmed video is a challenging task, due to the mass generation of false positive samples. Most existing methods localize higher response areas by extracting hand-crafted features or cropping specific regions from all or some key raw images. However, these methods either neglect the continuous temporal information or model the inherent human motion paradigms (background) as foreground. Consequently, we propose a novel two-stream network, named Local suppression and Global enhancement Spotting Network (LGSNet), which takes segment-level features from optical flow and videos as input. LGSNet adopts anchors to encode expression intervals and selects the encoded deviations as the object of optimization. Furthermore, we introduce a Temporal Multi-Receptive Field Feature Fusion Module (TMRF <inline-formula><tex-math notation="LaTeX">^{3}</tex-math></inline-formula> M) and a Local Suppression and Global Enhancement Module (LSGEM), which help spot short intervals more precisely and suppress background information. To further highlight the differences between positive and negative samples, we set up a large number of random pseudo ground truth intervals (background clips) on some discarded sliding windows to accomplish background clips modeling to counteract the effect of non-expressive face and head movements. Experimental results show that our proposed network achieves state-of-the-art performance on the CAS(ME)<inline-formula><tex-math notation="LaTeX">^{2}</tex-math></inline-formula>, CAS(ME)<inline-formula><tex-math notation="LaTeX">^{3}</tex-math></inline-formula> and SAMM-LV datasets. |
---|---|
ISSN: | 1949-3045 1949-3045 |
DOI: | 10.1109/TAFFC.2023.3266808 |