Balancing Locality and Concurrency: Solving Sparse Triangular Systems on GPUs

Many numerical optimisation problems rely on fast algorithms for solving sparse triangular systems of linear equations (STLs). To accelerate the solution of such equations, two types of approaches have been used: on GPUs, concurrency has been prioritised to the disadvantage of data locality, while o...

Full description

Saved in:

Bibliographic Details
Published in:	2016 IEEE 23rd International Conference on High Performance Computing (HiPC) pp. 183 - 192
Main Authors:	Picciau, Andrea, Inggs, Gordon E., Wickerson, John, Kerrigan, Eric C., Constantinides, George A.
Format:	Conference Proceeding
Language:	English
Published:	IEEE 01-12-2016
Subjects:	Algorithm design and analysis concurrency Concurrent computing Context CUSPARSE data locality Data structures GPU Graphics processing units linear algebra OpenCL Partitioning algorithms sparse Sparse matrices systems of equations
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Many numerical optimisation problems rely on fast algorithms for solving sparse triangular systems of linear equations (STLs). To accelerate the solution of such equations, two types of approaches have been used: on GPUs, concurrency has been prioritised to the disadvantage of data locality, while on multi-core CPUs, data locality has been prioritised to the disadvantage of concurrency. In this paper, we discuss the interaction between data locality and concurrency in the solution of STLs on GPUs, and we present a new algorithm that balances both. We demonstrate empirically that, subject to there being enough concurrency available in the input matrix, our algorithm outperforms Nvidia's concurrency-prioritising CUSPARSE algorithm for GPUs. Experimental results show a maximum speedup of 5.8-fold. Our solution algorithm, which we have implemented in OpenCL, requires a pre-processing phase that partitions the graph associated with the input matrix into sub-graphs, whose data can be stored in low-latency local memories. This preliminary analysis phase is expensive, but because it depends only on the input matrix, its cost can be amortised when solving for many different right-hand sides.
DOI:	10.1109/HiPC.2016.030