A first order approximation to the optimum checkpoint interval
Communications of the ACM
Is remote host availability governed by a universal law?
ACM SIGMETRICS Performance Evaluation Review
Towards informatic analysis of syslogs
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?
ACM Transactions on Storage (TOS)
Alert Detection in System Logs
ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
How are Real Grids Used? The Analysis of Four Grid Traces and Its Implications
GRID '06 Proceedings of the 7th IEEE/ACM International Conference on Grid Computing
DRAM errors in the wild: a large-scale field study
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
DMTCP: Transparent checkpointing for cluster computations and the desktop
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
International Journal of High Performance Computing Applications
The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
A realistic evaluation of memory hardware errors and software system susceptibility
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Availability in globally distributed storage systems
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
On the Scheduling of Checkpoints in Desktop Grids
CCGRID '11 Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
Characterizing result errors in internet desktop grids
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Adaptive event prediction strategy with dynamic time window for large-scale HPC systems
SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Enabling Application Resilience with and without the MPI Standard
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Fault prediction under the microscope: a closer look into HPC systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
On distributed file tree walk of parallel file systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Failure prediction for HPC systems and applications: Current situation and open issues
International Journal of High Performance Computing Applications
Checkpointing algorithms and fault prediction
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
As supercomputers and clusters increase in size and complexity, system failures are inevitable. Different hardware components (such as memory, disk, or network) of such systems can have different failure rates. Prior works assume failures equally affect an application, whereas our goal is to provide failure models for applications that reflect their specific component usage. This is challenging because component failure dynamics are heterogeneous in space and time. To this end, we study 5 years of system logs from a production high-performance computing system and model hardware failures involving processors, memory, storage and network components. We model each component and construct integrated failure models given the component usage of common supercomputing applications. We show that these application-centric models provide more accurate reliability estimates compared to general models, which improves the efficacy of fault-tolerant algorithms. In particular, we demonstrate how applications can tune their checkpointing strategies to the tailored model.