MLPerf HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems
Scientific communities are increasingly adopting machine learning and deep learning models in their applications to accelerate scientific insights. High performance computing systems are pushing the frontiers of performance with a rich diversity of hardware resources and massive scale-out capabiliti...
Saved in:
Main Authors: | , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
21-10-2021
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Scientific communities are increasingly adopting machine learning and deep
learning models in their applications to accelerate scientific insights. High
performance computing systems are pushing the frontiers of performance with a
rich diversity of hardware resources and massive scale-out capabilities. There
is a critical need to understand fair and effective benchmarking of machine
learning applications that are representative of real-world scientific use
cases. MLPerf is a community-driven standard to benchmark machine learning
workloads, focusing on end-to-end performance metrics. In this paper, we
introduce MLPerf HPC, a benchmark suite of large-scale scientific machine
learning training applications driven by the MLCommons Association. We present
the results from the first submission round, including a diverse set of some of
the world's largest HPC systems. We develop a systematic framework for their
joint analysis and compare them in terms of data staging, algorithmic
convergence, and compute performance. As a result, we gain a quantitative
understanding of optimizations on different subsystems such as staging and
on-node loading of data, compute-unit utilization, and communication
scheduling, enabling overall $>10 \times$ (end-to-end) performance improvements
through system scaling. Notably, our analysis shows a scale-dependent interplay
between the dataset size, a system's memory hierarchy, and training convergence
that underlines the importance of near-compute storage. To overcome the
data-parallel scalability challenge at large batch sizes, we discuss specific
learning techniques and hybrid data-and-model parallelism that are effective on
large systems. We conclude by characterizing each benchmark with respect to
low-level memory, I/O, and network behavior to parameterize extended roofline
performance models in future rounds. |
---|---|
DOI: | 10.48550/arxiv.2110.11466 |