Arbitrarily shaped scene text detection with dynamic convolution

•According to the detailed characteristics of the text instance, we dynamically generate the convolutional kernels from multi-feature for different instances. The specific attributes such as position, scale, and center, have been embedded into the convolutional kernel so that the mask prediction tas...

Full description

Saved in:
Bibliographic Details
Published in:Pattern recognition Vol. 127; p. 108608
Main Authors: Cai, Ying, Liu, Yuliang, Shen, Chunhua, Jin, Lianwen, Li, Yidong, Ergu, Daji
Format: Journal Article
Language:English
Published: Elsevier Ltd 01-07-2022
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•According to the detailed characteristics of the text instance, we dynamically generate the convolutional kernels from multi-feature for different instances. The specific attributes such as position, scale, and center, have been embedded into the convolutional kernel so that the mask prediction task using the text-instance-aware kernel will focus on the pixels that belong to themselves. Obviously, this design is helpful to improve the detection accuracy of adjacent text instances.•We generate the respective mask prediction head for each instance in parallel. These heads predict masks on the original feature map and retain resolution details of the text instance. It is no longer necessary to crop the RoIs and force them to be the same size. Our architecture overcomes the problem that a set of fixed convolution kernels cannot adapt to all resolutions, and at the same time preventing the loss of information caused by the multi-scales of the instances.•Because improving the text-instance-aware convolutional kernel increases the capacity of the model, we can also achieve competitive results with a very compact prediction head. Therefore, multiple mask prediction heads can be concurrently predicted without bringing significant computational overhead.•For the sake of improving the performance and accelerating the convergence of training, we design a text-shape sensitive position embedding to explicitly provide the location information to the mask prediction head. Arbitrarily shaped scene text detection has witnessed great development in recent years, and text detection using segmentation has been proven to an effective approach. However, problems caused by the diverse attributes of text instances, such as shapes, scales, and presentation styles (dense or sparse), persist. In this paper, we propose a novel text detector, termed DText, which can effectively formulate an arbitrarily shaped scene text detection task based on dynamic convolution. Our method can dynamically generate independent text-instance-aware convolutional parameters for each text instance from multi-features thus overcoming some intractable limitations of arbitrary text detection, such as the splitting of similar adjacent text, which poses challenges to fixed instance-shared convolutional parameters-based methods. Unlike standard segmentation methods relying on regions-of-interest bounding boxes, DText focuses on enhancing the flexibility of the network to retain details of instances from diverse resolutions while effectively improving prediction accuracy. Moreover, we propose encoding the shape and position information according to the characteristics of the text instance, termed text-shape sensitive position embedding. Thus, it can provide explicit shape and position information to the generator of the dynamic convolution parameters. Experiments on five benchmarks (Total-Text, SCUT-CTW1500, MSRA-TD500, ICDAR2015, and MLT) showed that our method achieves superior detection performance.
ISSN:0031-3203
1873-5142
DOI:10.1016/j.patcog.2022.108608