Error-control coding for computer systems
Error-control coding for computer systems
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
A first order approximation to the optimum checkpoint interval
Communications of the ACM
Proceedings of the 2nd conference on Computing frontiers
Pin: building customized program analysis tools with dynamic instrumentation
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
On the Architectural Requirements for Efficient Execution of Graph Algorithms
ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Performance counters and development of SPEC CPU2006
ACM SIGARCH Computer Architecture News
The PARSEC benchmark suite: characterization and architectural implications
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
System RAS implications of DRAM soft errors
IBM Journal of Research and Development
Memory Systems: Cache, DRAM, Disk
Memory Systems: Cache, DRAM, Disk
Future scaling of processor-memory interfaces
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
PCRAMsim: system-level performance, energy, and area modeling for phase-change ram
Proceedings of the 2009 International Conference on Computer-Aided Design
A higher order estimate of the optimum checkpoint interval for restart dumps
Future Generation Computer Systems
Reducing cache power with low-cost, multi-bit error-correcting codes
Proceedings of the 37th annual international symposium on Computer architecture
Silicon-photonic network architectures for scalable, power-efficient multi-chip systems
Proceedings of the 37th annual international symposium on Computer architecture
Rethinking DRAM design and organization for energy-constrained multi-cores
Proceedings of the 37th annual international symposium on Computer architecture
A realistic evaluation of memory hardware errors and software system susceptibility
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
DRAM errors in the wild: a large-scale field study
Communications of the ACM
The structural simulation toolkit
ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Lightweight Chip Multi-Threading (LCMT): Maximizing Fine-Grained Parallelism On-Chip
IEEE Transactions on Parallel and Distributed Systems
FREE-p: Protecting non-volatile memory against both hard and soft errors
HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Proceedings of the International Conference on Computer-Aided Design
Evaluating operating system vulnerability to memory errors
Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
LOT-ECC: localized and tiered reliability mechanisms for commodity memory systems
Proceedings of the 39th Annual International Symposium on Computer Architecture
MAGE: adaptive granularity and ECC for resilient and power efficient memory systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Detection and correction of silent data corruption for large-scale high-performance computing
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Exploring DRAM organizations for energy-efficient and resilient exascale memories
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Implicit-storing and redundant-encoding-of-attribute information in error-correction-codes
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
Resiliency will be one of the toughest challenges in future exascale systems. Memory errors contribute more than 40% of the total hardware-related failures and are projected to increase in future exascale systems. The use of error correction codes (ECC) and checkpointing are two effective approaches to fault tolerance. While there are numerous studies on ECC or checkpointing in isolation, this is the first paper to investigate the combined effect of both on overall system performance and power. Specifically, we study the impact of various ECC schemes (SECDED, BCH, and chipkill) in conjunction with checkpointing on future exascale systems. Our simulation results show that while chipkill is 13% better for computation-intensive applications, BCH has a 28% advantage in system energy-delay product (EDP) for memory-intensive applications. We also propose to use BCH in tagged memory systems with commodity DRAMs where chipkill is impractical. Our proposed architecture achieves 2.3x better system EDP than state-of-the-art tagged memory systems.