From Unimodal to Multimodal: Scaling up Projectors to Align Modalities
Recent contrastive multimodal vision-language models like CLIP have demonstrated robust open-world semantic understanding, becoming the standard image backbones for vision-language applications due to their aligned latent space. However, this practice has left powerful unimodal encoders for both vis...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
28-09-2024
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Recent contrastive multimodal vision-language models like CLIP have
demonstrated robust open-world semantic understanding, becoming the standard
image backbones for vision-language applications due to their aligned latent
space. However, this practice has left powerful unimodal encoders for both
vision and language underutilized in multimodal applications which raises a key
question: Is there a plausible way to connect unimodal backbones for zero-shot
vision-language tasks? To this end, we propose a novel approach that aligns
vision and language modalities using only projection layers on pretrained,
frozen unimodal encoders. Our method exploits the high semantic similarity
between embedding spaces of well-trained vision and language models. It
involves selecting semantically similar encoders in the latent space, curating
a concept-rich dataset of image-caption pairs, and training simple MLP
projectors. We evaluated our approach on 12 zero-shot classification datasets
and 2 image-text retrieval datasets. Our best model, utilizing DINOv2 and
All-Roberta-Large text encoder, achieves 76\(\%\) accuracy on ImageNet with a
20-fold reduction in data and 65 fold reduction in compute requirements. The
proposed framework enhances the accessibility of model development while
enabling flexible adaptation across diverse scenarios, offering an efficient
approach to building multimodal models by utilizing existing unimodal
architectures. Code and datasets will be released soon. |
---|---|
DOI: | 10.48550/arxiv.2409.19425 |