A large-scale study of failures in high-performance computing systems

Authors:
Bianca Schroeder;Garth A. Gibson
Affiliations:
Carnegie Mellon University;Carnegie Mellon University
Venue:
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Year:
2006

Citing 0
Cited 110

Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Using queue structures to improve job reliability

Proceedings of the 16th international symposium on High performance distributed computing
Proactive fault tolerance for HPC with Xen virtualization

Proceedings of the 21st annual international conference on Supercomputing
Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?

ACM Transactions on Storage (TOS)
Reliable multiprocessor system-on-chip synthesis

CODES+ISSS '07 Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Analysis and optimization of service availability in a HA cluster with load-dependent machine availability

IEEE Transactions on Parallel and Distributed Systems
Compiler-enhanced incremental checkpointing for OpenMP applications

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
DMTracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Exploring event correlation for failure prediction in coalitions of clusters

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Performance under failures of high-end computing

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Analyzing the impact of churn and malicious behavior on the quality of peer-to-peer web search

Proceedings of the 2008 ACM symposium on Applied computing
Software defect repair times: a multiplicative model

Proceedings of the 4th international workshop on Predictor models in software engineering
Performability modeling for scheduling and fault tolerance strategies for scientific workflows

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Compiler-Enhanced Incremental Checkpointing

Languages and Compilers for Parallel Computing
Flexible provisioning of web service workflows

ACM Transactions on Internet Technology (TOIT)
MPIWiz: subgroup reproducible replay of mpi applications

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
On the dynamic resource availability in grids

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
An analysis of clustered failures on large supercomputing systems

Journal of Parallel and Distributed Computing
Towards resilient high performance applications through real time reliability metric generation and autonomous failure correction

Proceedings of the 2009 workshop on Resiliency in high performance
Methodologies for advance warning of compute cluster problems via statistical analysis: a case study

Proceedings of the 2009 workshop on Resiliency in high performance
Characterizing fault tolerance in genetic programming

BADS '09 Proceedings of the 2009 workshop on Bio-inspired algorithms for distributed systems
DRAM errors in the wild: a large-scale field study

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
SPLAY: distributed systems evaluation made simple (or how to turn ideas into live systems in a breeze)

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Performance under Failures of DAG-based Parallel Computing

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
A performance study of grid workflow engines

GRID '08 Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing
Optimal real number codes for fault tolerant matrix operations

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A reputation-driven scheduler for autonomic and sustainable resource sharing in Grid computing

Journal of Parallel and Distributed Computing
Failure-aware resource management for high-availability computing clusters with distributed virtual machines

Journal of Parallel and Distributed Computing
A multiplicative model of software defect repair times

Empirical Software Engineering
Characterizing fault tolerance in genetic programming

Future Generation Computer Systems
A study of dynamic meta-learning for failure prediction in large-scale systems

Journal of Parallel and Distributed Computing
Fault perturbations in building sensor network data streams

International Journal of Sensor Networks
A tradeoff analysis of delayed reconstruction for storage clusters

Proceedings of the 6th International Wireless Communications and Mobile Computing Conference
The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Distributed Diskless Checkpoint for Large Scale Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
An Analysis of Traces from a Production MapReduce Cluster

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Quantifying event correlations for proactive failure management in networked computing systems

Journal of Parallel and Distributed Computing
Hunting for problems with Artemis

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
A flexible checkpoint/restart model in distributed systems

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Modelling pilot-job applications on production grids

Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Modeling resubmission in unreliable grids: the bottom-up approach

Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
A model for space-correlated failures in large-scale distributed systems

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Improving message logging protocols scalability through distributed event logging

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Failure-aware workflow scheduling in cluster environments

Cluster Computing
DRAM errors in the wild: a large-scale field study

Communications of the ACM
A rising tide lifts all boats: how memory error prediction and prevention can help with virtualized system longevity

HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Risk aware overbooking for commercial grids

JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
The importance of complete data sets for job scheduling simulations

JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
Algorithm-based recovery for HPL

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs

Proceedings of the sixth conference on Computer systems
RAFT at work: speeding-up mapreduce applications under task and node failures

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
High performance linpack benchmark: a fault tolerant implementation without checkpointing

Proceedings of the international conference on Supercomputing
Algorithm-based recovery for iterative methods without checkpointing

Proceedings of the 20th international symposium on High performance distributed computing
Vrisha: using scaling properties of parallel programs for bug detection and localization

Proceedings of the 20th international symposium on High performance distributed computing
Baler: deterministic, lossless log message clustering tool

Computer Science - Research and Development
Towards IT systems capable of managing their health

FOCS'10 Proceedings of the 16th Monterey conference on Foundations of computer software: modeling, development, and verification of adaptive systems
Event log mining tool for large scale HPC systems

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
A Robust and Efficient Message Passing Library for Volunteer Computing Environments

Journal of Grid Computing
A model of pilot-job resource provisioning on production grids

Parallel Computing
Adaptive event prediction strategy with dynamic time window for large-scale HPC systems

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
FTI: high performance fault tolerance interface for hybrid systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Checkpointing strategies for parallel jobs

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Evaluating the viability of process replication reliability for exascale systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Modeling and tolerating heterogeneous failures in large parallel systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
System implications of memory reliability in exascale computing

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Job failures in high performance computing systems: A large-scale empirical study

Computers & Mathematics with Applications
Failure prediction and localization in large scientific workflows

Proceedings of the 6th workshop on Workflows in support of large-scale science
Failure data-driven selective node-level duplication to improve MTTF in high performance computing systems

HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Online workflow management and performance analysis with stampede

Proceedings of the 7th International Conference on Network and Services Management
Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Characterizing fault-tolerance of genetic algorithms in desktop grid systems

EvoCOP'10 Proceedings of the 10th European conference on Evolutionary Computation in Combinatorial Optimization
Application monitoring and checkpointing in HPC: looking towards exascale systems

Proceedings of the 50th Annual Southeast Regional Conference
Evaluating application vulnerability to soft errors in multi-level cache hierarchy

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
On the viability of checkpoint compression for extreme scale fault tolerance

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Temperature management in data centers: why some (might) like it hot
HOPE: A Hybrid Optimistic checkpointing and selective Pessimistic mEssage logging protocol for large scale distributed systems

Future Generation Computer Systems
Checkpointing Orchestration: Toward a Scalable HPC Fault-Tolerant Environment

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Evaluating operating system vulnerability to memory errors

Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
3-Dimensional root cause diagnosis via co-analysis

Proceedings of the 9th international conference on Autonomic computing
Characterizing output bottlenecks in a supercomputer

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Alleviating scalability issues of checkpointing protocols

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A study of DRAM failures in the field

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Fault prediction under the microscope: a closer look into HPC systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Themis: an I/O-efficient MapReduce

Proceedings of the Third ACM Symposium on Cloud Computing
A decentralized approach for mining event correlations in distributed system monitoring

Journal of Parallel and Distributed Computing
Probabilistic versus possibilistic risk assessment models for optimal service level agreements in grid computing

Information Systems and e-Business Management
A reliability model for cloud computing for high performance computing applications

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
The viability of using compression to decrease message log sizes

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
A 1 PB/s file system to checkpoint three million MPI tasks

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Using unreliable virtual hardware to inject errors in extreme-scale systems

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
When is multi-version checkpointing needed?

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Evaluating the feasibility of using memory content similarity to improve system resilience

Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers
Performance comparison under failures of MPI and MapReduce: An analytical approach

Future Generation Computer Systems
Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures

ACM Transactions on Storage (TOS)
The Failure Trace Archive: Enabling the comparison of failure measurements and models of distributed systems

Journal of Parallel and Distributed Computing
Failure analysis of distributed scientific workflows executing in the cloud

Proceedings of the 8th International Conference on Network and Service Management
DynamicCloudSim: simulating heterogeneity in computational clouds

Proceedings of the 2nd ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies
Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
ACR: automatic checkpoint/restart for soft and hard error protection

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
The analytic hierarchy process: task scheduling and resource allocation in cloud computing environment

The Journal of Supercomputing
Predictable quality of service atop degradable distributed systems

Cluster Computing
Evaluating energy savings for checkpoint/restart

E2SC '13 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing
Reliability model of a system of k nodes with simultaneous failures for high-performance computing applications

International Journal of High Performance Computing Applications
Making problem diagnosiswork for large-scale, production storage systems

LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
Checkpointing algorithms and fault prediction

Journal of Parallel and Distributed Computing
Resource failures risk assessment modelling in distributed environments

Journal of Systems and Software
Report from the second workshop on scalable workflow enactment engines and technology (SWEET'13)

ACM SIGMOD Record

Quantified Score

Hi-index	0.02

Visualization

Abstract

Designing highly dependable systems requires a good understanding of failure characteristics. Unfortunately, little raw data on failures in large IT installations is publicly available. This paper analyzes failure data recently made publicy available by one of the largest high-performance computing sites. The data has been collected over the past 9 years at Los Alamos National Laboratory and includes 23000 failures recorded on more than 20 different systems, mostly large clusters of SMP and NUMA nodes. We study the statistics of the data, including the root cause of failures, the mean time between failures, and the mean time to repair. We find for example that average failure rates differ wildly across systems, ranging from 20-1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate. From one system to another, mean repair time varies from less than an hour to more than a day, and repair times are well modeled by a lognormal distribution.