Spatial Feature Calibration and Temporal Fusion for Effective One-stage Video Instance Segmentation

Modern one-stage video instance segmentation networks suffer from two limitations. First, convolutional features are neither aligned with anchor boxes nor with ground-truth bounding boxes, reducing the mask sensitivity to spatial location. Second, a video is directly divided into individual frames f...

Full description

Saved in:

Bibliographic Details
Published in:	2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 11210 - 11219
Main Authors:	Li, Minghan, Li, Shuai, Li, Lida, Zhang, Lei
Format:	Conference Proceeding
Language:	English
Published:	IEEE 01-01-2021
Subjects:	Computer vision Convolutional codes Correlation Motion segmentation Redundancy Sensitivity Tracking
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Modern one-stage video instance segmentation networks suffer from two limitations. First, convolutional features are neither aligned with anchor boxes nor with ground-truth bounding boxes, reducing the mask sensitivity to spatial location. Second, a video is directly divided into individual frames for frame-level instance segmentation, ignoring the temporal correlation between adjacent frames. To address these issues, we propose a simple yet effective one-stage video instance segmentation framework by spatial calibration and temporal fusion, namely STMask. To ensure spatial feature calibration with ground-truth bounding boxes, we first predict regressed bounding boxes around ground-truth bounding boxes, and extract features from them for frame-level instance segmentation. To further explore temporal correlation among video frames, we aggregate a temporal fusion module to infer instance masks from each frame to its adjacent frames, which helps our frame-work to handle challenging videos such as motion blur, partial occlusion and unusual object-to-camera poses. Experiments on the YouTube-VIS valid set show that the proposed STMask with ResNet-50/-101 backbone obtains 33.5 % / 36.8 % mask AP, while achieving 28.6 / 23.4 FPS on video instance segmentation. The code is released online https://github.com/MinghanLi/STMask.
ISSN:	2575-7075
DOI:	10.1109/CVPR46437.2021.01106