Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems

Authors:
Jinsuk Chung;Ikhwan Lee;Michael Sullivan;Jee Ho Ryoo;Dong Wan Kim;Doe Hyun Yoon;Larry Kaplan;Mattan Erez
Affiliations:
The University of Texas at Austin, Austin, TX, USA;The University of Texas at Austin, Austin, TX, USA;The University of Texas at Austin, Austin, TX, USA;The University of Texas at Austin, Austin, TX, USA;The University of Texas at Austin, Austin, TX, USA;IBM, Yorktown Heights, NY, USA;Cray Inc., Seattle, WA, USA;The University of Texas at Austin, Austin, TX, USA
Venue:
Scientific Programming - Selected Papers from Super Computing 2012
Year:
2013

Citing 25
Cited 0

Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Transactional memory: architectural support for lock-free data structures

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
A case for two-level distributed recovery schemes

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Fault-tolerant computer system design

Fault-tolerant computer system design
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Guardians and Actions: Linguistic Support for Robust, Distributed Programs

ACM Transactions on Programming Languages and Systems (TOPLAS)
Concurrency control for resilient nested transactions

PODS '83 Proceedings of the 2nd ACM SIGACT-SIGMOD symposium on Principles of database systems
Performance Evaluation of a Two Level Error Recovery Scheme for Distributed Systems

IWDC '02 Proceedings of the 4th International Workshop on Distributed Computing, Mobile and Wireless Computing
Evaluation of checkpoint mechanisms for massively parallel machines

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
System structure for software fault tolerance

Proceedings of the international conference on Reliable software
Transactional Memory Coherence and Consistency

Proceedings of the 31st annual international symposium on Computer architecture
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Compiler-enhanced incremental checkpointing for OpenMP applications

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
DRAM errors in the wild: a large-scale field study

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
A standby-sparing technique with low energy-overhead for fault-tolerant hard real-time systems

CODES+ISSS '09 Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Data processing spheres of control

IBM Systems Journal
Relax: an architectural framework for software recovery of hardware faults

Proceedings of the 37th annual international symposium on Computer architecture
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Transparent redundant computing with MPI

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Cooperative checkpointing theory

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A Large-Scale Study of Failures in High-Performance Computing Systems

IEEE Transactions on Dependable and Secure Computing
Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes and evaluates a scalable and efficient resilience scheme based on the concept of containment domains. Containment domains are a programming construct that enable applications to express resilience needs and to interact with the system to tune and specialize error detection, state preservation and restoration, and recovery schemes. Containment domains have weak transactional semantics and are nested to take advantage of the machine and application hierarchies and to enable hierarchical state preservation, restoration and recovery. We evaluate the scalability and efficiency of containment domains using generalized trace-driven simulation and analytical analysis and show that containment domains are superior to both checkpoint restart and redundant execution approaches.