Frame as Video Clip: Proposal-Free Moment Retrieval by Semantic Aligned Frames

Video moment retrieval or temporal sentence grounding in videos is a promising technique with various industrial applications, especially in security and surveillance systems, enabling quick identification of specific moments in long videos using natural language. However, current approaches often i...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on industrial informatics Vol. 20; no. 11; pp. 13158 - 13168
Main Authors: Shi, Mingzhu, Su, Yuhao, Lin, Xinhui, Zao, Bin, Kong, Siqi, Tan, Muxian
Format: Journal Article
Language:English
Published: IEEE 01-11-2024
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Video moment retrieval or temporal sentence grounding in videos is a promising technique with various industrial applications, especially in security and surveillance systems, enabling quick identification of specific moments in long videos using natural language. However, current approaches often involve segmenting videos into short clips and encoding each clip by video encoder, which leads to significant computational burdens due to the high frame requirements during the encoding process. To tackle this challenge, we propose an efficient moment retrieval approach named frame as video clip, which integrates sparsely sampled video frames and pretrained vision-language models, employing a proposal-free strategy based on a vanilla transformer. It only requires essential modalities (video and text) and minimal domain knowledge. The proposed approach effectively reduces the length of input video frames by over 25 times, potentially reaching up to 100 times in certain scenarios. Furthermore, it achieves competitive performance on ActivityNet Captions and Charades-STA datasets.
ISSN:1551-3203
1941-0050
DOI:10.1109/TII.2024.3431097