ReBEC: A replacement-based energy-efficient fault-tolerance design for associative caches

•Several cache blocks in lower-level caches are infrequently accessed and indicates optimization space.•Dynamic error detection and correction coding scheme is designed for energy efficiency.•Separate extra storage area for strong check codes is organized and coupling with traditional cache.•Dynamic...

Full description

Saved in:
Bibliographic Details
Published in:Future generation computer systems Vol. 155; pp. 39 - 52
Main Authors: Gao, Xin, Cui, Naiyuan, Nian, Jiawei, Liang, Zongnan, Gao, Jiaxuan, Liu, Hongjin, Yang, Mengfei
Format: Journal Article
Language:English
Published: Elsevier B.V 01-06-2024
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•Several cache blocks in lower-level caches are infrequently accessed and indicates optimization space.•Dynamic error detection and correction coding scheme is designed for energy efficiency.•Separate extra storage area for strong check codes is organized and coupling with traditional cache.•Dynamic check scheme significantly reduces energy consumption overhead without performance loss. Severe environments like space radiation can induce soft errors in processors and incur unexpected bit-flips. Error Detection and Correction (EDAC) is a crucial method to protect the on-chip cache hierarchy against soft errors. However, conventional schemes employed in modern processors almost always focus on fixed EDAC protection designs, without exploiting the dynamic access behaviors of cache blocks during runtime. In these schemes, parity or Hamming coding format is used in a pre-defined way and does not change during the read/write access in caches, which may induce unnecessary energy overhead. Surprisingly, we observe that several cache blocks, especially in low-level caches, are accessed occasionally or even not accessed at all, which do not require the strong protection. In this paper we propose a configurable dynamic fault-tolerance cache design, called Replacement-Based EDAC Cache (ReBEC), to improve energy efficiency in modern reliable cache hierarchy. We divide the error protection space into three levels and leverage the access counters in replacement policies such as LRU and its derivatives to adaptively adjust the protection level of each cache block. To reduce energy consumption, the newly inserted cache blocks will be initially protected in the weak level and promoted adaptively when its access priority is elevated. The evaluation results illustrate that our proposal outperforms the traditional schemes on the SPEC CPU benchmark suite while achieving comparable fault-tolerance capability. ReBEC is capable of reducing the dynamic energy consumption overhead of check schemes by up to 43.5% and achieving an average reduction of 23.6% compared to the fixed protection design. Moreover, our proposal is orthogonal to previous EDAC schemes and can be reconfigured to further enhance the fault-tolerance capability.
ISSN:0167-739X
1872-7115
DOI:10.1016/j.future.2024.01.022