Multi-GPU implementation of a time-explicit finite volume solver using CUDA and a CUDA-Aware version of OpenMPI with application to shallow water flows
This paper shows the development of a multi-GPU version of a time-explicit finite volume solver for the Shallow-Water Equations (SWE) on a multi-GPU architecture. MPI is combined with CUDA-Fortran in order to use as many GPUs as needed and the METIS library is leveraged to perform a domain decomposi...
Saved in:
Published in: | Computer physics communications Vol. 271; p. 108190 |
---|---|
Main Authors: | , |
Format: | Journal Article |
Language: | English |
Published: |
Elsevier B.V
01-02-2022
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | This paper shows the development of a multi-GPU version of a time-explicit finite volume solver for the Shallow-Water Equations (SWE) on a multi-GPU architecture. MPI is combined with CUDA-Fortran in order to use as many GPUs as needed and the METIS library is leveraged to perform a domain decomposition on the 2D unstructured triangular meshes of interest. A CUDA-Aware version of OpenMPI is adopted to speed up the messages between the MPI processes. A study of both speed-up and efficiency is conducted; first, for a classic dam-break flow in a canal, and then for two real domains with complex bathymetries. In both cases, meshes with up to 12 million cells are used. Using 24 to 28 GPUs on these meshes leads to an efficiency of 80% and more. Finally, the multi-GPU version is compared to the pure MPI multi-CPU version, and it is concluded that in this particular case, about 100 CPU cores would be needed to achieve the same performance as one GPU. The developed methodology is applicable for general time-explicit Riemann solvers for conservation laws.
•Multi-GPU version of a finite volume solver for the Shallow-Water Equations using CUDA and a CUDA-Aware version of OpenMPI.•Domain decomposition of 2D unstructured meshes using METIS with a specific renumbering for efficient memory exchange.•Achievement of a 21x speed-up when using 32 GPUs compared to utilizing a single GPU.•Comparison of the Multi-GPU and Multi-CPU versions of our in-house code shows that 8 GPUs perform as well as 1024 CPU cores. |
---|---|
ISSN: | 0010-4655 1879-2944 |
DOI: | 10.1016/j.cpc.2021.108190 |