Semantic Refocused Tuning for Open-Vocabulary Panoptic Segmentation
Open-vocabulary panoptic segmentation is an emerging task aiming to accurately segment the image into semantically meaningful masks based on a set of texts. Despite existing efforts, it remains challenging to develop a high-performing method that generalizes effectively across new domains and requir...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
24-09-2024
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Open-vocabulary panoptic segmentation is an emerging task aiming to
accurately segment the image into semantically meaningful masks based on a set
of texts. Despite existing efforts, it remains challenging to develop a
high-performing method that generalizes effectively across new domains and
requires minimal training resources. Our in-depth analysis of current methods
reveals a crucial insight: mask classification is the main performance
bottleneck for open-vocab. panoptic segmentation. Based on this, we propose
Semantic Refocused Tuning (SMART), a novel framework that greatly enhances
open-vocab. panoptic segmentation by improving mask classification through two
key innovations. First, SMART adopts a multimodal Semantic-guided Mask
Attention mechanism that injects task-awareness into the regional information
extraction process. This enables the model to capture task-specific and
contextually relevant information for more effective mask classification.
Second, it incorporates Query Projection Tuning, which strategically fine-tunes
the query projection layers within the Vision Language Model (VLM) used for
mask classification. This adjustment allows the model to adapt the image focus
of mask tokens to new distributions with minimal training resources, while
preserving the VLM's pre-trained knowledge. Extensive ablation studies confirm
the superiority of our approach. Notably, SMART sets new state-of-the-art
results, demonstrating improvements of up to +1.3 PQ and +5.4 mIoU across
representative benchmarks, while reducing training costs by nearly 10x compared
to the previous best method. Our code and data will be released. |
---|---|
DOI: | 10.48550/arxiv.2409.16278 |