Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities

Authors:
Franck Cappello
Affiliations:
INRIA AND UIUC, THOMAS M. SIEBEL CENTER FOR COMPUTERSCIENCE, 201 N GOODWIN AVE, URBANA, IL 61801-2302, USA
Venue:
International Journal of High Performance Computing Applications
Year:
2009

Citing 16
Cited 17

Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Memory exclusion: optimizing the performance of checkpointing systems

Software—Practice & Experience
Self-stabilizing systems in spite of distributed control

Communications of the ACM
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
BlueGene/L Failure Analysis and Prediction Models

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Scalable diskless checkpointing for large parallel systems

Scalable diskless checkpointing for large parallel systems
What Supercomputers Say: A Study of Five System Logs

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Fault-Driven Re-Scheduling For Improving System-level Fault Resilience

ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
Algorithm-Based Fault Tolerance for Matrix Operations

IEEE Transactions on Computers
Hierarchical Replication Techniques to Ensure Checkpoint Storage Reliability in Grid Environment

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Proactive process-level live migration in HPC environments

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A tunable holistic resiliency approach for high-performance computing systems

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Proactive fault tolerance in MPI applications via task migration

HiPC'06 Proceedings of the 13th international conference on High Performance Computing

Distributed Diskless Checkpoint for Large Scale Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Virtualizing high performance computing

ACM SIGOPS Operating Systems Review
Hybrid checkpointing using emerging nonvolatile memories for future exascale systems

ACM Transactions on Architecture and Code Optimization (TACO)
High performance linpack benchmark: a fault tolerant implementation without checkpointing

Proceedings of the international conference on Supercomputing
Algorithm-based recovery for iterative methods without checkpointing

Proceedings of the 20th international symposium on High performance distributed computing
On the use of cluster-based partial message logging to improve fault tolerance for MPI HPC applications

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
FTI: high performance fault tolerance interface for hybrid systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Evaluating the viability of process replication reliability for exascale systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Scientific data services: a high-performance I/O system with array semantics

Proceedings of the first annual workshop on High performance computing meets databases
Algorithm-based fault tolerance for dense matrix factorizations

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Tuple switching network-When slower may be better

Journal of Parallel and Distributed Computing
Developing a power measurement framework for cyber defense

Proceedings of the Eighth Annual Cyber Security and Information Intelligence Research Workshop
When self-stabilization meets real platforms: An experimental study of a peer-to-peer service discovery system

Future Generation Computer Systems
ACR: automatic checkpoint/restart for soft and hard error protection

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

The Journal of Supercomputing
Post-failure recovery of MPI communication capability: Design and rationale

International Journal of High Performance Computing Applications
Evaluating energy savings for checkpoint/restart

E2SC '13 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The emergence of petascale systems and the promise of future exascale systems have reinvigorated the community interest in how to manage failures in such systems and ensure that large applications, lasting several hours or tens of hours, are completed successfully. Most of the existing results for several key mechanisms associated with fault tolerance in high-performance computing (HPC) platforms follow the rollbackâ聙聰recovery approach. Over the last decade, these mechanisms have received a lot of attention from the community with different levels of success. Unfortunately, despite their high degree of optimization, existing approaches do not fit well with the challenging evolutions of large-scale systems. There is room and even a need for new approaches. Opportunities may come from different origins: diskless checkpointing, algorithmic-based fault tolerance, proactive operation, speculative execution, software transactional memory, forward recovery, etc. The contributions of this paper are as follows: (1) we summarize and analyze the existing results concerning the failures in large-scale computers and point out the urgent need for drastic improvements or disruptive approaches for fault tolerance in these systems; (2) we sketch most of the known opportunities and analyze their associated limitations; (3) we extract and express the challenges that the HPC community will have to face for addressing the stringent issue of failures in HPC systems.