An Efficient Convolutional Multi-Scale Vision Transformer for Image Classification

This paper introduces an innovative and efficient multi-scale Vision Transformer (ViT) for the task of image classification. The proposed model leverages the inherent power of transformer architecture and combines it with the concept of multi-scale processing generally used in convolutional neural n...

Full description

Saved in:
Bibliographic Details
Published in:2023 International Conference on Image Processing, Computer Vision and Machine Learning (ICICML) pp. 344 - 347
Main Authors: Zhang, Ji, Chen, Zhihao, Ge, Yiyuan, Yu, Mingxin
Format: Conference Proceeding
Language:English
Published: IEEE 03-11-2023
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This paper introduces an innovative and efficient multi-scale Vision Transformer (ViT) for the task of image classification. The proposed model leverages the inherent power of transformer architecture and combines it with the concept of multi-scale processing generally used in convolutional neural networks (CNNs). The work aims to address the limitations of conventional ViTs which typically operate at a single scale, hence overlooking the hierarchical structure in visual data. The multi-scale ViT enhances classification performance by processing image features at different scales, effectively capturing both low-level and high-level semantic information. Extensive experimental results demonstrate the superior performance of the proposed model over standard ViTs and other state-of-the-art image classification methods, signifying the effectiveness of the multi-scale approach. This research opens new avenues for incorporating scale-variance in transformer-based models for improved performance in vision tasks.
DOI:10.1109/ICICML60161.2023.10424909