Modeling and tolerating heterogeneous failures in large parallel systems

Authors:
Eric Heien;Derrick Kondo;Ana Gainaru;Dan LaPine;Bill Kramer;Franck Cappello
Affiliations:
INRIA, France;INRIA, France;University Politehnica of Bucharest;UIUC, NCSA, Urbana, IL;NCSA, UIUC, Urbana, IL;INRIA, France, UIUC, Urbana, IL
Venue:
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2011

Citing 16
Cited 6

A first order approximation to the optimum checkpoint interval

Communications of the ACM
Is remote host availability governed by a universal law?

ACM SIGMETRICS Performance Evaluation Review
Towards informatic analysis of syslogs

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?

ACM Transactions on Storage (TOS)
Alert Detection in System Logs

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
How are Real Grids Used? The Analysis of Four Grid Traces and Its Implications

GRID '06 Proceedings of the 7th IEEE/ACM International Conference on Grid Computing
DRAM errors in the wild: a large-scale field study

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
DMTCP: Transparent checkpointing for cluster computations and the desktop

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Toward Exascale Resilience

International Journal of High Performance Computing Applications
The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
A realistic evaluation of memory hardware errors and software system susceptibility

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Availability in globally distributed storage systems

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
On the Scheduling of Checkpoints in Desktop Grids

CCGRID '11 Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
Characterizing result errors in internet desktop grids

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing

Adaptive event prediction strategy with dynamic time window for large-scale HPC systems

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Enabling Application Resilience with and without the MPI Standard

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Fault prediction under the microscope: a closer look into HPC systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
On distributed file tree walk of parallel file systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Failure prediction for HPC systems and applications: Current situation and open issues

International Journal of High Performance Computing Applications
Checkpointing algorithms and fault prediction

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

As supercomputers and clusters increase in size and complexity, system failures are inevitable. Different hardware components (such as memory, disk, or network) of such systems can have different failure rates. Prior works assume failures equally affect an application, whereas our goal is to provide failure models for applications that reflect their specific component usage. This is challenging because component failure dynamics are heterogeneous in space and time. To this end, we study 5 years of system logs from a production high-performance computing system and model hardware failures involving processors, memory, storage and network components. We model each component and construct integrated failure models given the component usage of common supercomputing applications. We show that these application-centric models provide more accurate reliability estimates compared to general models, which improves the efficacy of fault-tolerant algorithms. In particular, we demonstrate how applications can tune their checkpointing strategies to the tailored model.