Accelerating Communication for Parallel Programming Models on GPU Systems
As an increasing number of leadership-class systems embrace GPU accelerators in the race towards exascale, efficient communication of GPU data is becoming one of the most critical components of high-performance computing. For developers of parallel programming models, implementing support for GPU-aw...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
24-02-2021
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | As an increasing number of leadership-class systems embrace GPU accelerators
in the race towards exascale, efficient communication of GPU data is becoming
one of the most critical components of high-performance computing. For
developers of parallel programming models, implementing support for GPU-aware
communication using native APIs for GPUs such as CUDA can be a daunting task as
it requires considerable effort with little guarantee of performance. In this
work, we demonstrate the capability of the Unified Communication X (UCX)
framework to compose a GPU-aware communication layer that serves multiple
parallel programming models of the Charm++ ecosystem: Charm++, Adaptive MPI
(AMPI), and Charm4py. We demonstrate the performance impact of our designs with
microbenchmarks adapted from the OSU benchmark suite, obtaining improvements in
latency of up to 10.1x in Charm++, 11.7x in AMPI, and 17.4x in Charm4py. We
also observe increases in bandwidth of up to 10.1x in Charm++, 10x in AMPI, and
10.5x in Charm4py. We show the potential impact of our designs on real-world
applications by evaluating a proxy application for the Jacobi iterative method,
improving the communication performance by up to 12.4x in Charm++, 12.8x in
AMPI, and 19.7x in Charm4py. |
---|---|
DOI: | 10.48550/arxiv.2102.12416 |