The design of ultra scalable MPI collective communication on the K computer

This paper proposes the design of ultra scalable MPI collective communication for the K computer, which consists of 82,944 computing nodes and is the world’s first system over 10 PFLOPS. The nodes are connected by a Tofu interconnect that introduces six dimensional mesh/torus topology. Existing MPI...

Full description

Saved in:

Bibliographic Details
Published in:	Computer science (Berlin, Germany) Vol. 28; no. 2-3; pp. 147 - 155
Main Authors:	Adachi, Tomoya, Shida, Naoyuki, Miura, Kenichi, Sumimoto, Shinji, Uno, Atsuya, Kurokawa, Motoyoshi, Shoji, Fumiyoshi, Yokokawa, Mitsuo
Format:	Journal Article
Language:	English
Published:	Berlin/Heidelberg Springer-Verlag 01-05-2013
Subjects:	Computer Hardware Computer Science Computer Systems Organization and Communication Networks Data Structures and Information Theory Software Engineering/Programming and Operating Systems Special Issue Paper Theory of Computation Torus network MPI collective communication K computer
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	This paper proposes the design of ultra scalable MPI collective communication for the K computer, which consists of 82,944 computing nodes and is the world’s first system over 10 PFLOPS. The nodes are connected by a Tofu interconnect that introduces six dimensional mesh/torus topology. Existing MPI libraries, however, perform poorly on such a direct network system since they assume typical cluster environments. Thus, we design collective algorithms optimized for the K computer. On the design of the algorithms, we place importance on collision-freeness for long messages and low latency for short messages. The long-message algorithms use multiple RDMA network interfaces and consist of neighbor communication in order to gain high bandwidth and avoid message collisions. On the other hand, the short-message algorithms are designed to reduce software overhead, which comes from the number of relaying nodes. The evaluation results on up to 55,296 nodes of the K computer show the new implementation outperforms the existing one for long messages by a factor of 4 to 11 times. It also shows the short-message algorithms complement the long-message ones.
ISSN:	1865-2034 1865-2042
DOI:	10.1007/s00450-012-0211-7