Analyzing and Mitigating the Cost of Persistence in High-Performance Computing Systems

Non-Volatile Main Memory (NVMM) technologies provide many opportunities for applications in modern and future systems. One of these opportunities is the ability to achieve data persistence in the main memory rather than having to store data in the much-slower storage. This major opportunity makes it...

Full description

Saved in:

Bibliographic Details
Main Author:	Elnawawy, Hussein Mohamed
Format:	Dissertation
Language:	English
Published:	ProQuest Dissertations & Theses 01-01-2020
Subjects:	Computer Engineering
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Non-Volatile Main Memory (NVMM) technologies provide many opportunities for applications in modern and future systems. One of these opportunities is the ability to achieve data persistence in the main memory rather than having to store data in the much-slower storage. This major opportunity makes it possible to use the main memory for checkpointing mechanisms to provide failure safety to applications. However, the non-volatility characteristic of NVMMs raises a potential data inconsistency problem. Therefore, applications need to ensure data consistency when making updates to critical data structures.In light of these opportunities and challenges of NVMMs, we believe legacy checkpointing and logging schemes must be revisited as they incur high execution time overheads as well as a large number of additional writes to NVMM, which may significantly impact its write endurance.We analyze different checkpointing techniques when used with NVMMs. Then, we propose a novel recompute-based failure safety approach and demonstrate its applicability to loop-based code. Rather than keeping a fully consistent logging state, we only log enough state to enable recomputation. Upon a failure, our approach recovers to a consistent state by determining which parts of the computation were not completed and recomputing them. Effectively, our approach removes the need to keep checkpoints or logs, thus reducing execution time overheads and improving NVMM write endurance, at the expense of more complex recovery.We then go deeper into the different implementations of General Matrix Multiply (GEMM) algorithms to analyze their performance and impact on NVMM with our Recompute technique for checkpointing. Our aim is to to analyze the persistence/performance trade-offs that the programmer should be aware of. To approach this problem, we perform experiments on the different implementations of GEMM from one of the most widely used basic linear algebra algorithms, the GotoBLAS library.After that, we address the overhead of persistence in NVMMs, in general. To address the problem of potential data inconsistency, applications need to make sure that updates to persistent data comply with the crash consistency model. To achieve that, transaction-based approaches, such as write-ahead logging, have been proposed to ensure data consistency in the NVMM. These approaches use cache line flushing followed by a memory or store fence to ensure data durability.While cache line flush and write back instructions can complete in the background, fence instructions expose the latency of flushing to the critical path of the program’s execution incurring significant overheads.We observe that if flush operations are started earlier, the penalty of fences can be significantly reduced.We propose PreFlush, a lightweight and transparent hardware mechanism that predicts when a cache line flush or write back is needed and speculatively performs the operation early. Since we speculatively perform the flush, we add hardware to determine when it is more profitable to flush, and we also handle cases where the preflush misspeculates and still ensures correct execution without the need for any complex recovery mechanisms. Also, PreFlush requires no modification on existing NVMM-enabled code.
ISBN:	9798641392691