Fast and Scalable Multicore YOLOv3-Tiny Accelerator Using Input Stationary Systolic Architecture

This article proposes a scalable accelerator for deep learning (DL) implementation on edge computing, which is often limited by power, storage, and computation speed. The accelerator is based on systolic array cores with 126 processing elements (PEs) and optimized for YOLOv3-Tiny with 448 <inline...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on very large scale integration (VLSI) systems Vol. 31; no. 11; pp. 1 - 14
Main Authors:	Adiono, Trio, Ramadhan, Rhesa Muhammad, Sutisna, Nana, Syafalni, Infall, Mulyawan, Rahmat, Lin, Chang-Hong
Format:	Journal Article
Language:	English
Published:	New York IEEE 01-11-2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:	Computer architecture Control data (computers) Convolutional neural networks (CNNs) Data transmission Edge computing Feature maps Frames per second input stationary (IS) dataflow Multicasting scalable accelerator systolic array YOLOv3-Tiny
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	This article proposes a scalable accelerator for deep learning (DL) implementation on edge computing, which is often limited by power, storage, and computation speed. The accelerator is based on systolic array cores with 126 processing elements (PEs) and optimized for YOLOv3-Tiny with 448 <inline-formula> <tex-math notation="LaTeX">\(\times\)</tex-math> </inline-formula> 448 input images. Two multicast (MC) network architectures, feature map multicasting and weight multicasting, are introduced to control data stream distribution within the multicores. Results show that the proposed weight multicast (W-MC) systems outperformed the feature map multicast (FMAP-MC) systems in multicore scenarios, with up to 2.23<inline-formula> <tex-math notation="LaTeX">\(\times\)</tex-math> </inline-formula> frame rates per second (FPS). The 4-core W-MC system achieved the best efficiency with an overall frame rate of 13.73 FPS/W and an overall throughput of 35.83 GOPS/W. The 8-core W-MC system delivered the best performance, with a frame rate of 38.50 FPS after normalization to the standard YOLOv3-Tiny network. The proposed accelerator offers better computational efficiency and greater accelerator utilization in real-world inference scenarios, compared to previous state-of-the-art works.
ISSN:	1063-8210 1557-9999
DOI:	10.1109/TVLSI.2023.3305937