RaPiD: AI Accelerator for Ultra-low Precision Training and Inference

The growing prevalence and computational demands of Artificial Intelligence (AI) workloads has led to widespread use of hardware accelerators in their execution. Scaling the performance of AI accelerators across generations is pivotal to their success in commercial deployments. The intrinsic error-r...

Full description

Saved in:
Bibliographic Details
Published in:2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA) pp. 153 - 166
Main Authors: Venkataramani, Swagath, Srinivasan, Vijayalakshmi, Wang, Wei, Sen, Sanchari, Zhang, Jintao, Agrawal, Ankur, Kar, Monodeep, Jain, Shubham, Mannari, Alberto, Tran, Hoang, Li, Yulong, Ogawa, Eri, Ishizaki, Kazuaki, Inoue, Hiroshi, Schaal, Marcel, Serrano, Mauricio, Choi, Jungwook, Sun, Xiao, Wang, Naigang, Chen, Chia-Yu, Allain, Allison, Bonano, James, Cao, Nianzheng, Casatuta, Robert, Cohen, Matthew, Fleischer, Bruce, Guillorn, Michael, Haynie, Howard, Jung, Jinwook, Kang, Mingu, Kim, Kyu-hyoun, Koswatta, Siyu, Lee, Saekyu, Lutz, Martin, Mueller, Silvia, Oh, Jinwook, Ranjan, Ashish, Ren, Zhibin, Rider, Scot, Schelm, Kerstin, Scheuermann, Michael, Silberman, Joel, Yang, Jie, Zalani, Vidhi, Zhang, Xin, Zhou, Ching, Ziegler, Matt, Shah, Vinay, Ohara, Moriyoshi, Lu, Pong-Fei, Curran, Brian, Shukla, Sunil, Chang, Leland, Gopalakrishnan, Kailash
Format: Conference Proceeding
Language:English
Published: IEEE 01-06-2021
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The growing prevalence and computational demands of Artificial Intelligence (AI) workloads has led to widespread use of hardware accelerators in their execution. Scaling the performance of AI accelerators across generations is pivotal to their success in commercial deployments. The intrinsic error-resilient nature of AI workloads present a unique opportunity for performance/energy improvement through precision scaling. Motivated by the recent algorithmic advances in precision scaling for inference and training, we designed RaPiD 1 , a 4-core AI accelerator chip supporting a spectrum of precisions, namely, 16 and 8-bit floating-point and 4 and 2-bit fixed-point. The 36mm 2 RaPiD chip fabricated in 7nm EUV technology delivers a peak 3.5 TFLOPS/W in HFP8 mode and 16.5 TOPS/W in INT4 mode at nominal voltage. Using a performance model calibrated to within 1% of the measurement results, we evaluated DNN inference using 4-bit fixed-point representation for a 4-core 1 RaPiD chip system and DNN training using 8-bit floating point representation for a 768 TFLOPs AI system comprising 4 32-core RaPiD chips. Our results show INT4 inference for batch size of 1 achieves 3 - 13.5 (average 7) TOPS/W and FP8 training for a mini-batch of 512 achieves a sustained 102 - 588 (average 203) TFLOPS across a wide range of applications.
ISSN:2575-713X
DOI:10.1109/ISCA52012.2021.00021