An Efficient Convolutional Multi-Scale Vision Transformer for Image Classification
This paper introduces an innovative and efficient multi-scale Vision Transformer (ViT) for the task of image classification. The proposed model leverages the inherent power of transformer architecture and combines it with the concept of multi-scale processing generally used in convolutional neural n...
Saved in:
Published in: | 2023 International Conference on Image Processing, Computer Vision and Machine Learning (ICICML) pp. 344 - 347 |
---|---|
Main Authors: | , , , |
Format: | Conference Proceeding |
Language: | English |
Published: |
IEEE
03-11-2023
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | This paper introduces an innovative and efficient multi-scale Vision Transformer (ViT) for the task of image classification. The proposed model leverages the inherent power of transformer architecture and combines it with the concept of multi-scale processing generally used in convolutional neural networks (CNNs). The work aims to address the limitations of conventional ViTs which typically operate at a single scale, hence overlooking the hierarchical structure in visual data. The multi-scale ViT enhances classification performance by processing image features at different scales, effectively capturing both low-level and high-level semantic information. Extensive experimental results demonstrate the superior performance of the proposed model over standard ViTs and other state-of-the-art image classification methods, signifying the effectiveness of the multi-scale approach. This research opens new avenues for incorporating scale-variance in transformer-based models for improved performance in vision tasks. |
---|---|
DOI: | 10.1109/ICICML60161.2023.10424909 |