Vectorizing divergent control flow with active-lane consolidation on long-vector architectures

Control-flow divergence limits the applicability of loop vectorization, an important code-transformation that accelerates data-parallel loops. Control-flow divergence is commonly handled using an IF-conversion transformation combined with vector predication. However, the resulting vector instruction...

Full description

Saved in:

Bibliographic Details
Published in:	The Journal of supercomputing Vol. 78; no. 10; pp. 12553 - 12588
Main Authors:	Praharenka, Wyatt, Pankratz, David, De Carvalho, João P. L., Amiri, Ehsan, Amaral, José Nelson
Format:	Journal Article
Language:	English
Published:	New York Springer US 01-07-2022 Springer Nature B.V
Subjects:	Active control Compilers Computer Science Consolidation Divergence Interpreters Optimization Permutations Processor Architectures Programming Languages Transformations Code generation Scalable vector extension Control-flow divergence Vectorization Instruction-set architecture design
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Control-flow divergence limits the applicability of loop vectorization, an important code-transformation that accelerates data-parallel loops. Control-flow divergence is commonly handled using an IF-conversion transformation combined with vector predication. However, the resulting vector instructions execute inefficiently with many inactive lanes. Branch-on-superword-condition-code (BOSCC) instructions are used to skip over some vector instructions, but their effectiveness decreases as vector length increases. This paper presents a novel vector permutation, Active-lane consolidation ( ALC ), that enables efficient execution of control-divergent loops by consolidating the active lanes of two vectors. This paper demonstrates the use of ALC with two loop transformations and applies them to kernels extracted from the SPEC CPU 2017 benchmark suite leading to up to a 30.9% reduction in dynamic instruction count compared to optimization using only BOSCCs. Motivated by ALC , this paper also proposes design changes to the ARM scalable vector extension (SVE) to improve vectorization of control-divergent loops.
ISSN:	0920-8542 1573-0484
DOI:	10.1007/s11227-022-04359-w