Pooling Acceleration in the DaVinci Architecture Using Im2col and Col2im Instructions

Image-to-column (Im2col) and column-to-image (Col2im) are data transformations extensively used to map convolution to matrix multiplication. These transformations rearrange the inputs of convolution to avoid its strided memory access pattern, thus providing a friendlier data layout for CPUs and GPUs...

Full description

Saved in:

Bibliographic Details
Published in:	2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) pp. 46 - 55
Main Authors:	Rohwedder, Caio S., de Carvalho, Joao P. L., Amaral, Jose Nelson, Araujo, Guido, Colmenares, Giancarlo, Wang, Kai-Ting Amy
Format:	Conference Proceeding
Language:	English
Published:	IEEE 01-06-2021
Subjects:	AI Accelerator AI accelerators CNN Computer architecture Conferences Convolution Distributed processing Gradient Layout Maxpool Software TVM
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Image-to-column (Im2col) and column-to-image (Col2im) are data transformations extensively used to map convolution to matrix multiplication. These transformations rearrange the inputs of convolution to avoid its strided memory access pattern, thus providing a friendlier data layout for CPUs and GPUs. In artificial intelligence (AI) accelerators, these transformations allow convolution to be computed in matrix-multiplier units. Implemented in software, however, they impose a significant overhead that must be compensated by the efficiency gains of matrix multipliers. DaVinci is an AI accelerator architecture that introduces instructions to optimize Im2col and Col2im. Another core layer of convolutional neural networks that presents a strided memory access pattern is pooling. This paper explores the specialized Im2col and Col2im instructions to accelerate pooling layers in DaVinci. An experimental evaluation reveals that the proposed pooling implementations can yield speedups of up to 5.8 times compared to a baseline that does not use these specialized instructions. The speedups follow from an improved memory layout in the inputs of pooling, as this layout leads to better utilization of the vector processing unit in DaVinci.
DOI:	10.1109/IPDPSW52791.2021.00016