MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Semantic Segmentation
Open-vocabulary semantic segmentation aims to segment and recognize semantically meaningful regions based on text-based descriptions during inference. A typical solution to address this task is to leverage powerful vision-language models (VLMs), such as CLIP, to bridge the gap between open- and clos...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
27-08-2024
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Open-vocabulary semantic segmentation aims to segment and recognize
semantically meaningful regions based on text-based descriptions during
inference. A typical solution to address this task is to leverage powerful
vision-language models (VLMs), such as CLIP, to bridge the gap between open-
and close-vocabulary recognition. As VLMs are usually pretrained with
low-resolution images (e.g. $224\times224$), most previous methods operate only
on downscaled images. We question this design as low resolution features often
fail to preserve fine details. Although employing additional image backbones
for high-resolution inputs can mitigate this issue, it may also introduce
significant computation overhead. Therefore, we propose MROVSeg, a
multi-resolution training framework for open-vocabulary semantic segmentation
with a single pretrained CLIP backbone, that uses sliding windows to slice the
high-resolution input into uniform patches, each matching the input size of the
well-trained image encoder. Its key components include a Multi-Res Adapter,
which restores the spatial geometry and grasps local-global correspondences
across patches by learnable convolutional and scale attention layers. To
achieve accurate segmentation, we introduce Multi-grained Masked Attention
scheme to aggregate multi-grained semantics by performing cross-attention
between object queries and multi-resolution CLIP features within the region of
interests. Through comprehensive experiments, we demonstrate the superiority of
MROVSeg on well-established open-vocabulary semantic segmentation benchmarks,
particularly for high-resolution inputs, establishing new standards for
open-vocabulary semantic segmentation. |
---|---|
DOI: | 10.48550/arxiv.2408.14776 |