Improving per-thread performance on CMPs through timing speculation

The future of performance scaling lies in massively parallel workloads, but less-parallel applications remain important. Unfortunately, future process technologies and core microarchitectures no longer promise major per-thread performance improvements, so microarchitects must find new ways to improv...

Full description

Saved in:
Bibliographic Details
Main Author: Greskamp, Brian L
Format: Dissertation
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The future of performance scaling lies in massively parallel workloads, but less-parallel applications remain important. Unfortunately, future process technologies and core microarchitectures no longer promise major per-thread performance improvements, so microarchitects must find new ways to improve per-thread performance. Moreover, they must do so without sacrificing parallel throughput. To meet these apparently conflicting demands, this dissertation proposes a Timing Speculation (TS) system for CMPs that boosts core clock frequencies past their normal limits when an application demands per-thread performance and operates efficiently at nominal frequency when it demands throughput. This dissertation begins by introducing Paceline, the first TS microarchitecture designed specifically for CMPs. Paceline enables two cores to work together to execute a single thread at high speed under TS or independently to execute two threads at the rated frequency. In single-thread mode, one core in the pair---the "Leader"---executes at higher-than-normal frequency, while a "Checker" runs at the rated, safe frequency and verifies the execution. Next, this work enhances Paceline with BlueShift, a circuit design method for TS architectures that improves a circuit's common-case delay rather than focusing on worst-case delay like traditional design flows. BlueShift profiles a gate-level design as it runs real benchmark applications to identify the frequently-exercised circuit paths and then applies speed optimizations to those paths only. These optimizations can be implemented in a way that can be enabled and disabled at run-time so that they do not exact a power cost when they are not needed (ie. when the processor is executing a throughput workload). Finally, this work presents LeadOut, a CMP design that combines Paceline with an additional per-thread performance enhancement: the ability to increase core supply voltage above nominal. LeadOut evaluates the performance gains that are possible with Paceline alone, voltage boosting alone, and both together. It shows major gains from applying the two techniques together when feasible and also shows that, in many cases, future CMPs have power and temperature headroom to exploit still more per-thread enhancements as long as they can be enabled and disabled dynamically according to application demand.
Bibliography:Source: Dissertation Abstracts International, Volume: 71-01, Section: B, page: 0424.
Adviser: Josep Torrellas.
ISBN:9781109575996
1109575998