System implications of memory reliability in exascale computing

Authors:
Sheng Li;Ke Chen;Ming-Yu Hsieh;Naveen Muralimanohar;Chad D. Kersey;Jay B. Brockman;Arun F. Rodrigues;Norman P. Jouppi
Affiliations:
Hewlett-Packard Labs;University of Notre Dame and Hewlett-Packard Labs;Sandia National Labs;Hewlett-Packard Labs;Georgia Institute of Technology;University of Notre Dame;Sandia National Labs;Hewlett-Packard Labs
Venue:
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2011

Citing 26
Cited 7

Error-control coding for computer systems

Error-control coding for computer systems
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
A first order approximation to the optimum checkpoint interval

Communications of the ACM
ELDORADO

Proceedings of the 2nd conference on Computing frontiers
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
On the Architectural Requirements for Efficient Execution of Graph Algorithms

ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Performance counters and development of SPEC CPU2006

ACM SIGARCH Computer Architecture News
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
System RAS implications of DRAM soft errors

IBM Journal of Research and Development
Memory Systems: Cache, DRAM, Disk

Memory Systems: Cache, DRAM, Disk
Future scaling of processor-memory interfaces

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
PCRAMsim: system-level performance, energy, and area modeling for phase-change ram

Proceedings of the 2009 International Conference on Computer-Aided Design
A higher order estimate of the optimum checkpoint interval for restart dumps

Future Generation Computer Systems
Reducing cache power with low-cost, multi-bit error-correcting codes

Proceedings of the 37th annual international symposium on Computer architecture
Silicon-photonic network architectures for scalable, power-efficient multi-chip systems

Proceedings of the 37th annual international symposium on Computer architecture
Rethinking DRAM design and organization for energy-constrained multi-cores

Proceedings of the 37th annual international symposium on Computer architecture
A realistic evaluation of memory hardware errors and software system susceptibility

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
DRAM errors in the wild: a large-scale field study

Communications of the ACM
The structural simulation toolkit

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
A framework for architecture-level power, area, and thermal simulation and its application to network-on-chip design exploration

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Lightweight Chip Multi-Threading (LCMT): Maximizing Fine-Grained Parallelism On-Chip

IEEE Transactions on Parallel and Distributed Systems
FREE-p: Protecting non-volatile memory against both hard and soft errors

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
CACTI-P: architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques

Proceedings of the International Conference on Computer-Aided Design

Evaluating operating system vulnerability to memory errors

Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
LOT-ECC: localized and tiered reliability mechanisms for commodity memory systems

Proceedings of the 39th Annual International Symposium on Computer Architecture
MAGE: adaptive granularity and ECC for resilient and power efficient memory systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Detection and correction of silent data corruption for large-scale high-performance computing

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Exploring DRAM organizations for energy-efficient and resilient exascale memories

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Implicit-storing and redundant-encoding-of-attribute information in error-correction-codes

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Resiliency will be one of the toughest challenges in future exascale systems. Memory errors contribute more than 40% of the total hardware-related failures and are projected to increase in future exascale systems. The use of error correction codes (ECC) and checkpointing are two effective approaches to fault tolerance. While there are numerous studies on ECC or checkpointing in isolation, this is the first paper to investigate the combined effect of both on overall system performance and power. Specifically, we study the impact of various ECC schemes (SECDED, BCH, and chipkill) in conjunction with checkpointing on future exascale systems. Our simulation results show that while chipkill is 13% better for computation-intensive applications, BCH has a 28% advantage in system energy-delay product (EDP) for memory-intensive applications. We also propose to use BCH in tagged memory systems with commodity DRAMs where chipkill is impractical. Our proposed architecture achieves 2.3x better system EDP than state-of-the-art tagged memory systems.