Do Vision and Language Encoders Represent the World Similarly?

Aligned text-image encoders such as CLIP have become the de-facto model for vision-language tasks. Further-more, modality-specific encoders achieve impressive per-formances in their respective domains. This raises a cen-tral question: does an alignment exist between uni-modal vision and language enc...

Full description

Saved in:

Bibliographic Details
Published in:	2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 14334 - 14343
Main Authors:	Maniparambil, Mayug, Akshulakov, Raiymbek, Dahou Djilali, Yasser Abdelaziz, Seddik, Mohamed EI Amine, Narayan, Sanath, Mangalam, Karttikeya, O'Connor, Noel E.
Format:	Conference Proceeding
Language:	English
Published:	IEEE 16-06-2024
Subjects:	CLIP Codes Computer vision Kernel Measurement Semantics Space communications Training Unified Representations Vision Language Zero-shot
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Aligned text-image encoders such as CLIP have become the de-facto model for vision-language tasks. Further-more, modality-specific encoders achieve impressive per-formances in their respective domains. This raises a cen-tral question: does an alignment exist between uni-modal vision and language encoders since they fundamentally rep-resent the same physical world? Analyzing the latent spaces structure of vision and language models on image-caption benchmarks using the Centered Kernel Alignment (CKA), we find that the representation spaces of unaligned and aligned encoders are semantically similar. In the absence of statistical similarity in aligned encoders like CLIP, we show that a possible matching of unaligned encoders exists with-out any training. We frame this as a seeded graph-matching problem exploiting the semantic similarity between graphs and propose two methods - a Fast Quadratic Assignment Problem optimization, and a novel localized CKA metric-based matching/retrieval. We demonstrate the effectiveness of this on several downstream tasks including cross-lingual, cross-domain caption matching and image classification. Code available at github.com/mayug/0-shot-llm-vision.
ISSN:	2575-7075
DOI:	10.1109/CVPR52733.2024.01359