UniTR: A Unified TRansformer-Based Framework for Co-Object and Multi-Modal Saliency Detection

Recent years have witnessed a growing interest in co-object segmentation and multi-modal salient object detection. Many efforts are devoted to segmenting co-existed objects among a group of images or detecting salient objects from different modalities. Albeit the appreciable performance achieved on...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on multimedia Vol. 26; pp. 7622 - 7635
Main Authors:	Guo, Ruohao, Ying, Xianghua, Qi, Yanyu, Qu, Liao
Format:	Journal Article
Language:	English
Published:	Piscataway IEEE 2024 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:	Benchmarks Co-object segmentation Computer architecture deep learning Feature extraction Image segmentation Modules multi-modal salient object detection Object detection Object recognition Salience Semantics Task analysis transformer Transformers
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Recent years have witnessed a growing interest in co-object segmentation and multi-modal salient object detection. Many efforts are devoted to segmenting co-existed objects among a group of images or detecting salient objects from different modalities. Albeit the appreciable performance achieved on respective benchmarks, each of these methods is limited to a specific task and cannot be generalized to other tasks. In this paper, we develop a Uni fied TR ansformer-based framework, namely UniTR , aiming at tackling the above tasks individually with a unified architecture. Specifically, a transformer module (CoFormer) is introduced to learn the consistency of relevant objects or complementarity from different modalities. To generate high-quality segmentation maps, we adopt a dual-stream decoding paradigm that allows the extracted consistent or complementary information to better guide mask prediction. Moreover, a feature fusion module (ZoomFormer) is designed to enhance backbone features and capture multi-granularity and multi-semantic information. Extensive experiments show that our UniTR performs well on 17 benchmarks , and surpasses existing state-of-the-art approaches.
ISSN:	1520-9210 1941-0077
DOI:	10.1109/TMM.2024.3369922