Maestro: orchestrating lifetime reliability in chip multiprocessors

Authors:
Shuguang Feng;Shantanu Gupta;Amin Ansari;Scott Mahlke
Affiliations:
Advanced Computer Architecture Laboratory, University of Michigan, Ann Arbor, MI;Advanced Computer Architecture Laboratory, University of Michigan, Ann Arbor, MI;Advanced Computer Architecture Laboratory, University of Michigan, Ann Arbor, MI;Advanced Computer Architecture Laboratory, University of Michigan, Ann Arbor, MI
Venue:
HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Year:
2010

Citing 17
Cited 3

A genetic algorithm for the generalised assignment problem

Computers and Operations Research
Wattch: a framework for architectural-level power analysis and optimizations

Proceedings of the 27th annual international symposium on Computer architecture
Temperature-aware microarchitecture: Modeling and implementation

ACM Transactions on Architecture and Code Optimization (TACO)
The Case for Lifetime Reliability-Aware Microprocessors

Proceedings of the 31st annual international symposium on Computer architecture
Heat-and-run: leveraging SMT and CMP to manage power density through the operating system

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Razor: Circuit-Level Correction of Timing Errors for Low-Power Operation

IEEE Micro
Deep Submicron CMOS Integrated Circuit Reliability Simulation with SPICE

ISQED '05 Proceedings of the 6th International Symposium on Quality of Electronic Design
Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation

IEEE Micro
Improved Thermal Management with Reliability Banking

IEEE Micro
Techniques for Multicore Thermal Management: Classification and New Exploration

Proceedings of the 33rd annual international symposium on Computer Architecture
ElastIC: An Adaptive Self-Healing Architecture for Unpredictable Silicon

IEEE Design & Test
ReCycle:: pipeline adaptation to tolerate process variation

Proceedings of the 34th annual international symposium on Computer architecture
Thermal-aware task scheduling at the system software level

ISLPED '07 Proceedings of the 2007 international symposium on Low power electronics and design
Self-calibrating Online Wearout Detection

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Variation-Aware Application Scheduling and Power Management for Chip Multiprocessors

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Facelift: Hiding and slowing down aging in multicores

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
The StageNet fabric for constructing resilient multicore systems

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture

Circuit reliability: from physics to architectures

Proceedings of the International Conference on Computer-Aided Design
Cost-effective lifetime and yield optimization for NoC-based MPSoCs

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Unified reliability estimation and management of NoC based chip multiprocessors

Microprocessors & Microsystems

Quantified Score

Hi-index	0.00

Visualization

Abstract

As CMOS feature sizes venture deep into the nanometer regime, wearout mechanisms including negative-bias temperature instability and time-dependent dielectric breakdown can severely reduce processor operating lifetimes and performance. This paper presents an introspective reliability management system, Maestro, to tackle reliability challenges in future chip multiprocessors (CMPs) head-on. Unlike traditional approaches, Maestro relies on low-level sensors to monitor the CMP as it ages (introspection). Leveraging this real-time assessment of CMP health, runtime heuristics identify wearout-centric job assignments (management). By exploiting the complementary effects of the natural heterogeneity (due to process variation and wearout) that exists in CMPs and the diversity found in system workloads, Maestro composes job schedules that intelligently control the aging process. Monte Carlo experiments show that Maestro significantly enhances lifetime reliability through intelligent wear-leveling, increasing the expected service life of a population of 16-core CMPs by as much as 38% compared to a naive, round-robin scheduler. Furthermore, in the presence of process variation, Maestro's wearout-centric scheduling outperformed both performance counter and temperature sensor based schedulers, achieving an order of magnitude more improvement in lifetime throughput – the amount of useful work done by a system prior to failure.