Frame as Video Clip: Proposal-Free Moment Retrieval by Semantic Aligned Frames
Video moment retrieval or temporal sentence grounding in videos is a promising technique with various industrial applications, especially in security and surveillance systems, enabling quick identification of specific moments in long videos using natural language. However, current approaches often i...
Saved in:
Published in: | IEEE transactions on industrial informatics Vol. 20; no. 11; pp. 13158 - 13168 |
---|---|
Main Authors: | , , , , , |
Format: | Journal Article |
Language: | English |
Published: |
IEEE
01-11-2024
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Video moment retrieval or temporal sentence grounding in videos is a promising technique with various industrial applications, especially in security and surveillance systems, enabling quick identification of specific moments in long videos using natural language. However, current approaches often involve segmenting videos into short clips and encoding each clip by video encoder, which leads to significant computational burdens due to the high frame requirements during the encoding process. To tackle this challenge, we propose an efficient moment retrieval approach named frame as video clip, which integrates sparsely sampled video frames and pretrained vision-language models, employing a proposal-free strategy based on a vanilla transformer. It only requires essential modalities (video and text) and minimal domain knowledge. The proposed approach effectively reduces the length of input video frames by over 25 times, potentially reaching up to 100 times in certain scenarios. Furthermore, it achieves competitive performance on ActivityNet Captions and Charades-STA datasets. |
---|---|
ISSN: | 1551-3203 1941-0050 |
DOI: | 10.1109/TII.2024.3431097 |