C-Brain: A deep learning accelerator that tames the diversity of CNNs through adaptive data-level parallelization

Convolutional neural networks (CNN) accelerators have been proposed as an efficient hardware solution for deep learning based applications, which are known to be both compute-and-memory intensive. Although the most advanced CNN accelerators can deliver high computational throughput, the performance...

Full description

Saved in:
Bibliographic Details
Published in:2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC) pp. 1 - 6
Main Authors: Lili Song, Ying Wang, Yinhe Han, Xin Zhao, Bosheng Liu, Xiaowei Li
Format: Conference Proceeding
Language:English
Published: IEEE 05-06-2016
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Convolutional neural networks (CNN) accelerators have been proposed as an efficient hardware solution for deep learning based applications, which are known to be both compute-and-memory intensive. Although the most advanced CNN accelerators can deliver high computational throughput, the performance is highly unstable. Once changed to accommodate a new network with different parameters like layers and kernel size, the fixed hardware structure, may no longer well match the data flows. Consequently, the accelerator will fail to deliver high performance due to the underutilization of either logic resource or memory bandwidth. To overcome this problem, we proposed a novel deep learning accelerator, which offers multiple types of data-level parallelism: inter-kernel, intra-kernel and hybrid. Our design can adaptively switch among the three types of parallelism and the corresponding data tiling schemes to dynamically match different networks or even different layers of a single network. No matter how we change the hardware configurations or network types, the proposed network mapping strategy ensures the optimal performance and energy-efficiency. Compared with previous state-of-the-art NN accelerators, it is possible to achieve a speedup of 4.0x-8.3x for some layers of the well-known large scale CNNs. For the whole phase of network forward-propagation, our design achieves 28.04% PE energy saving, 90.3% on-chip memory energy saving on average.
DOI:10.1145/2897937.2897995