Multi-Language Transformer For Improved Text to Remote Sensing Image Retrieval

Cross-Modal text-image retrieval in remote sensing (RS) provides a flexible retrieval experience for mining useful information from RS repositories. However, existing methods are designed to accept queries formulated in the English language only, which may restrict accessibility to useful informatio...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE journal of selected topics in applied earth observations and remote sensing pp. 1 - 12
Main Authors:	Rahhal, Mohamad M. Al, Bazi, Yakoub, Alsharif, Norah A., Bashmal, Laila, Alajlan, Naif, Melgani, Farid
Format:	Journal Article
Language:	English
Published:	IEEE 2022
Subjects:	Contrastive loss cross-modal retrieval Feature extraction Image retrieval language transformer Optical filters remote sensing Semantics Task analysis Transformers vision transformer Visualization
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Cross-Modal text-image retrieval in remote sensing (RS) provides a flexible retrieval experience for mining useful information from RS repositories. However, existing methods are designed to accept queries formulated in the English language only, which may restrict accessibility to useful information for non-English speakers. Allowing multi-language queries can enhance the communication with the retrieval system and broaden access to the RS information. To address this limitation, this paper proposes a multi-language framework based on transformers. Specifically, our framework is composed of two transformer encoders for learning modality-specific representations, the first is a language encoder for generating language representation features from the textual description, while the second is a vision encoder for extracting visual features from the corresponding image. The two encoders are trained jointly on image and text pairs by minimizing a bidirectional contrastive loss. To enable the model to understand queries in multiple languages, we trained it on descriptions from four different languages, namely, English, Arabic, French and Italian. The experimental results on three benchmark datasets (i.e., RSITMD, RSICD, and UCM) demonstrate that the proposed model improves significantly the retrieval performances in terms of recall compared to the existing state-of-the-art RS retrieval methods.
ISSN:	1939-1404 2151-1535
DOI:	10.1109/JSTARS.2022.3215803