Bootstrapping SparseFormers from Vision Foundation Models
The recently proposed SparseFormer architecture provides an alternative approach to visual understanding by utilizing a significantly lower number of visual tokens via adjusting RoIs, greatly reducing computational costs while still achieving promising performance. However, training SparseFormers fr...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
04-12-2023
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The recently proposed SparseFormer architecture provides an alternative
approach to visual understanding by utilizing a significantly lower number of
visual tokens via adjusting RoIs, greatly reducing computational costs while
still achieving promising performance. However, training SparseFormers from
scratch is still expensive, and scaling up the number of parameters can be
challenging. In this paper, we propose to bootstrap SparseFormers from
ViT-based vision foundation models in a simple and efficient way. Since the
majority of SparseFormer blocks are the standard transformer ones, we can
inherit weights from large-scale pre-trained vision transformers and freeze
them as much as possible. Therefore, we only need to train the
SparseFormer-specific lightweight focusing transformer to adjust token RoIs and
fine-tune a few early pre-trained blocks to align the final token
representation. In such a way, we can bootstrap SparseFormer architectures from
various large-scale pre-trained models (e.g., IN-21K pre-trained AugRegs or
CLIPs) using a rather smaller amount of training samples (e.g., IN-1K) and
without labels or captions within just a few hours. As a result, the
bootstrapped unimodal SparseFormer (from AugReg-ViT-L/16-384) can reach 84.9%
accuracy on IN-1K with only 49 tokens, and the multimodal SparseFormer from
CLIPs also demonstrates notable zero-shot performance with highly reduced
computational cost without seeing any caption during the bootstrapping
procedure. In addition, CLIP-bootstrapped SparseFormers, which align the output
space with language without seeing a word, can serve as efficient vision
encoders in multimodal large language models. Code and models are available at
https://github.com/showlab/sparseformer |
---|---|
DOI: | 10.48550/arxiv.2312.01987 |