P2T: Pyramid Pooling Transformer for Scene Understanding

Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular sol...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on pattern analysis and machine intelligence Vol. 45; no. 11; pp. 12760 - 12771
Main Authors:	Wu, Yu-Huan, Liu, Yun, Zhan, Xin, Cheng, Ming-Ming
Format:	Journal Article
Language:	English
Published:	New York IEEE 01-11-2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:	backbone network Computational modeling Computer networks Convolution efficient self-attention Feature extraction Image classification Image segmentation Network design Object recognition pyramid pooling Scene analysis scene understanding Semantic segmentation Semantics Task analysis Transformer Transformers
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular solution to this problem is to use a single pooling operation to reduce the sequence length. This paper considers how to improve existing vision transformers, where the pooled feature extracted by a single pooling operation seems less powerful. To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful ability in context abstraction. However, pyramid pooling has not been explored in backbone network design. To bridge this gap, we propose to adapt pyramid pooling to Multi-Head Self-Attention (MHSA) in the vision transformer, simultaneously reducing the sequence length and capturing powerful contextual features. Plugged with our pooling-based MHSA, we build a universal vision transformer backbone, dubbed Pyramid Pooling Transformer (P2T). Extensive experiments demonstrate that, when applied P2T as the backbone network, it shows substantial superiority in various vision tasks such as image classification, semantic segmentation, object detection, and instance segmentation, compared to previous CNN- and transformer-based networks. The code will be released at https://github.com/yuhuan-wu/P2T .
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	0162-8828 2160-9292 1939-3539
DOI:	10.1109/TPAMI.2022.3202765