SipMaskv2: Enhanced Fast Image and Video Instance Segmentation

We propose a fast single-stage method for both image and video instance segmentation, called SipMask, that preserves the instance spatial information by performing multiple sub-region mask predictions. The main module in our method is a light-weight spatial preservation (SP) module that generates a...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on pattern analysis and machine intelligence Vol. 45; no. 3; pp. 3798 - 3812
Main Authors:	Cao, Jiale, Pang, Yanwei, Anwer, Rao Muhammad, Cholakkal, Hisham, Khan, Fahad Shahbaz, Shao, Ling
Format:	Journal Article
Language:	English
Published:	New York IEEE 01-03-2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:	Alignment Datasets Feature extraction Image enhancement Image instance segmentation Image segmentation Instance segmentation Modules Object detection Object recognition Proposals real-time Real-time systems single-stage method Source code Spatial data spatial information preservation Task analysis Training video instance segmentation Weight reduction Image instance segmentation spatial information preservation video instance segmentation real-time single-stage method
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	We propose a fast single-stage method for both image and video instance segmentation, called SipMask, that preserves the instance spatial information by performing multiple sub-region mask predictions. The main module in our method is a light-weight spatial preservation (SP) module that generates a separate set of spatial coefficients for the sub-regions within a bounding-box, enabling a better delineation of spatially adjacent instances. To better correlate mask prediction with object detection, we further propose a mask alignment weighting loss and a feature alignment scheme. In addition, we identify two issues that impede the performance of single-stage instance segmentation and introduce two modules, including a sample selection scheme and an instance refinement module, to address these two issues. Experiments are performed on both image instance segmentation dataset MS COCO and video instance segmentation dataset YouTube-VIS. On MS COCO test-dev set, our method achieves a state-of-the-art performance. In terms of real-time capabilities, it outperforms YOLACT by a gain of 3.0% (mask AP) under the similar settings, while operating at a comparable speed. On YouTube-VIS validation set, our method also achieves promising results. The source code is available at https://github.com/JialeCao001/SipMask .
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	0162-8828 1939-3539 1939-3539 2160-9292
DOI:	10.1109/TPAMI.2022.3180564