Evaluating the feasibility of using memory content similarity to improve system resilience

Authors:
Scott Levy;Patrick G. Bridges;Kurt B. Ferreira;Aidan P. Thompson;Christian Trott
Affiliations:
University of New Mexico;University of New Mexico;Sandia National Laboratories;Sandia National Laboratories;Sandia National Laboratories
Venue:
Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers
Year:
2013

Citing 20
Cited 0

Recovery in distributed systems using asynchronous message logging and checkpointing

PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Disco: running commodity operating systems on scalable multiprocessors

ACM Transactions on Computer Systems (TOCS)
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
BoomerAMG: a parallel algebraic multigrid solver and preconditioner

Applied Numerical Mathematics - Developments and trends in iterative methods for large systems of equations—in memoriam Rüdiger Weiss
Memory resource management in VMware ESX server

ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery

IEEE Transactions on Dependable and Secure Computing
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Avoiding the disk bottleneck in the data domain deduplication file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Impact of Technology and Voltage Scaling on the Soft Error Susceptibility in Nanoscale CMOS

DFT '08 Proceedings of the 2008 IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems
Difference engine: harnessing memory redundancy in virtual machines

Communications of the ACM
Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
The general birthday problem

Random Structures & Algorithms
libhashckpt: hash-based incremental checkpointing using GPU's

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Investigating the Impact of the Cielo Cray XE6 Architecture on Scientific Application Codes

IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
Exploiting Data Similarity to Reduce Memory Footprints

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Evaluating the viability of process replication reliability for exascale systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Cooperative Application/OS DRAM fault recovery

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
A case for tracking and exploiting inter-node and intra-node memory content sharing in virtualized large-scale parallel systems

Proceedings of the 6th international workshop on Virtualization Technologies in Distributed Computing Date

Quantified Score

Hi-index	0.00

Visualization

Abstract

Building the next-generation of extreme-scale distributed systems will require overcoming several challenges related to system resilience. As the number of processors in these systems grows, the failure rate increases proportionally. One of the most common sources of failure in large-scale systems is memory errors. In this paper, we propose a novel run-time for transparently exploiting memory content similarity to improve system resilience by reducing the rate at which memory errors lead to node failure. We evaluate the feasibility of this approach by examining memory snapshots collected from eight HPC applications. Based on the characteristics of the similarity that we uncover in these applications, we conclude that our proposed approach shows promise for addressing system resilience in large-scale systems.