MOVES: Motion-Oriented VidEo Sampling for Natural Language-Based Vehicle Retrieval

Retrieving the target vehicle through natural language descriptions plays a crucial role in intelligent transportation systems. Existing methods tackle this task by employing models that leverage the correlation between textual and visual representations, such as CLIP. However, these models struggle...

Full description

Saved in:

Bibliographic Details
Published in:	2024 IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS) pp. 1 - 7
Main Authors:	Kim, Dongyoung, Lee, Kyoungoh, Jang, In-Su, Kim, Kwang-Ju, Kim, Pyong-Kun, Yoo, Jaejun
Format:	Conference Proceeding
Language:	English
Published:	IEEE 15-07-2024
Subjects:	Correlation Data augmentation Data models Natural languages Sampling methods Surveillance Visualization
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Retrieving the target vehicle through natural language descriptions plays a crucial role in intelligent transportation systems. Existing methods tackle this task by employing models that leverage the correlation between textual and visual representations, such as CLIP. However, these models struggle to capture the temporal characteristics of video data, and researchers enhance temporal understanding performance through various data augmentation and video encoders. Yet, conventional approaches in previous studies often overlook the detailed temporal characteristics of vehicles. To overcome this limitation, we introduce a MOVES: Motion-Oriented VidEo Sampling method to effectively utilize the motion information of the target vehicle. Furthermore, we construct a robust model by implementing a re-ranking algorithm to address a variety of vehicle attributes. As a result, our proposed model achieves state-of-the-art performance on the public vehicle retrieval dataset.
ISSN:	2643-6213
DOI:	10.1109/AVSS61716.2024.10672583