Recovery in distributed systems using asynchronous message logging and checkpointing
PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Disco: running commodity operating systems on scalable multiprocessors
ACM Transactions on Computer Systems (TOCS)
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
BoomerAMG: a parallel algebraic multigrid solver and preconditioner
Applied Numerical Mathematics - Developments and trends in iterative methods for large systems of equations—in memoriam Rüdiger Weiss
Memory resource management in VMware ESX server
ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery
IEEE Transactions on Dependable and Secure Computing
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Avoiding the disk bottleneck in the data domain deduplication file system
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Impact of Technology and Voltage Scaling on the Soft Error Susceptibility in Nanoscale CMOS
DFT '08 Proceedings of the 2008 IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems
Difference engine: harnessing memory redundancy in virtual machines
Communications of the ACM
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Random Structures & Algorithms
libhashckpt: hash-based incremental checkpointing using GPU's
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Investigating the Impact of the Cielo Cray XE6 Architecture on Scientific Application Codes
IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
Exploiting Data Similarity to Reduce Memory Footprints
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Evaluating the viability of process replication reliability for exascale systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Cooperative Application/OS DRAM fault recovery
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Proceedings of the 6th international workshop on Virtualization Technologies in Distributed Computing Date
Hi-index | 0.00 |
Building the next-generation of extreme-scale distributed systems will require overcoming several challenges related to system resilience. As the number of processors in these systems grows, the failure rate increases proportionally. One of the most common sources of failure in large-scale systems is memory errors. In this paper, we propose a novel run-time for transparently exploiting memory content similarity to improve system resilience by reducing the rate at which memory errors lead to node failure. We evaluate the feasibility of this approach by examining memory snapshots collected from eight HPC applications. Based on the characteristics of the similarity that we uncover in these applications, we conclude that our proposed approach shows promise for addressing system resilience in large-scale systems.