Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS

The proliferation of machine learning applications has promoted both CUDA Cores and Tensor Cores' integration to meet their acceleration demands. While studies have shown that co-locating multiple tasks on the same GPU can effectively improve system throughput and resource utilization, existing...

Full description

Saved in:
Bibliographic Details
Published in:2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA) pp. 800 - 813
Main Authors: Zhao, Han, Cui, Weihao, Chen, Quan, Zhang, Youtao, Lu, Yanchao, Li, Chao, Leng, Jingwen, Guo, Minyi
Format: Conference Proceeding
Language:English
Published: IEEE 01-04-2022
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The proliferation of machine learning applications has promoted both CUDA Cores and Tensor Cores' integration to meet their acceleration demands. While studies have shown that co-locating multiple tasks on the same GPU can effectively improve system throughput and resource utilization, existing schemes focus on scheduling the resources of traditional CUDA Cores and thus lack the ability to exploit the parallelism between Tensor Cores and CUDA Cores.In this paper, we propose Tacker, a static kernel fusion and scheduling approach to improve GPU utilization of both types of cores while ensuring the QoS (Quality-of-Service) of co-located tasks. Tacker consists of a Tensor-CUDA Core kernel fuser, a duration predictor for fused kernels, and a runtime QoS-aware kernel manager. The kernel fuser enables the flexible fusion of kernels that use Tensor Cores and CUDA Cores, respectively. The duration predictor precisely predicts the duration of the fused kernels. Finally, the kernel manager invokes the fused kernel or the original kernel based on the QoS headroom of latency-critical tasks to improve the system throughput. Our experimental results show that Tacker improves the throughput of best-effort applications compared with state-of-the-art solutions by 18.6% on average, while ensuring the QoS of latency-critical tasks.
ISSN:2378-203X
DOI:10.1109/HPCA53966.2022.00064