Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Using queue structures to improve job reliability
Proceedings of the 16th international symposium on High performance distributed computing
Proactive fault tolerance for HPC with Xen virtualization
Proceedings of the 21st annual international conference on Supercomputing
Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?
ACM Transactions on Storage (TOS)
Reliable multiprocessor system-on-chip synthesis
CODES+ISSS '07 Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis
IEEE Transactions on Parallel and Distributed Systems
Compiler-enhanced incremental checkpointing for OpenMP applications
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
DMTracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Exploring event correlation for failure prediction in coalitions of clusters
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Performance under failures of high-end computing
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Analyzing the impact of churn and malicious behavior on the quality of peer-to-peer web search
Proceedings of the 2008 ACM symposium on Applied computing
Software defect repair times: a multiplicative model
Proceedings of the 4th international workshop on Predictor models in software engineering
Performability modeling for scheduling and fault tolerance strategies for scientific workflows
HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Compiler-Enhanced Incremental Checkpointing
Languages and Compilers for Parallel Computing
Flexible provisioning of web service workflows
ACM Transactions on Internet Technology (TOIT)
MPIWiz: subgroup reproducible replay of mpi applications
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
On the dynamic resource availability in grids
GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
An analysis of clustered failures on large supercomputing systems
Journal of Parallel and Distributed Computing
Proceedings of the 2009 workshop on Resiliency in high performance
Methodologies for advance warning of compute cluster problems via statistical analysis: a case study
Proceedings of the 2009 workshop on Resiliency in high performance
Characterizing fault tolerance in genetic programming
BADS '09 Proceedings of the 2009 workshop on Bio-inspired algorithms for distributed systems
DRAM errors in the wild: a large-scale field study
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Performance under Failures of DAG-based Parallel Computing
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
A performance study of grid workflow engines
GRID '08 Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing
Optimal real number codes for fault tolerant matrix operations
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A reputation-driven scheduler for autonomic and sustainable resource sharing in Grid computing
Journal of Parallel and Distributed Computing
Journal of Parallel and Distributed Computing
A multiplicative model of software defect repair times
Empirical Software Engineering
Characterizing fault tolerance in genetic programming
Future Generation Computer Systems
A study of dynamic meta-learning for failure prediction in large-scale systems
Journal of Parallel and Distributed Computing
Fault perturbations in building sensor network data streams
International Journal of Sensor Networks
A tradeoff analysis of delayed reconstruction for storage clusters
Proceedings of the 6th International Wireless Communications and Mobile Computing Conference
The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Distributed Diskless Checkpoint for Large Scale Systems
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
An Analysis of Traces from a Production MapReduce Cluster
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Quantifying event correlations for proactive failure management in networked computing systems
Journal of Parallel and Distributed Computing
Hunting for problems with Artemis
WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
A flexible checkpoint/restart model in distributed systems
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Modelling pilot-job applications on production grids
Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Modeling resubmission in unreliable grids: the bottom-up approach
Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
A model for space-correlated failures in large-scale distributed systems
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Improving message logging protocols scalability through distributed event logging
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Failure-aware workflow scheduling in cluster environments
Cluster Computing
DRAM errors in the wild: a large-scale field study
Communications of the ACM
HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Risk aware overbooking for commercial grids
JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
The importance of complete data sets for job scheduling simulations
JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
Algorithm-based recovery for HPL
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs
Proceedings of the sixth conference on Computer systems
RAFT at work: speeding-up mapreduce applications under task and node failures
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
High performance linpack benchmark: a fault tolerant implementation without checkpointing
Proceedings of the international conference on Supercomputing
Algorithm-based recovery for iterative methods without checkpointing
Proceedings of the 20th international symposium on High performance distributed computing
Vrisha: using scaling properties of parallel programs for bug detection and localization
Proceedings of the 20th international symposium on High performance distributed computing
Baler: deterministic, lossless log message clustering tool
Computer Science - Research and Development
Towards IT systems capable of managing their health
FOCS'10 Proceedings of the 16th Monterey conference on Foundations of computer software: modeling, development, and verification of adaptive systems
Event log mining tool for large scale HPC systems
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
A Robust and Efficient Message Passing Library for Volunteer Computing Environments
Journal of Grid Computing
A model of pilot-job resource provisioning on production grids
Parallel Computing
Adaptive event prediction strategy with dynamic time window for large-scale HPC systems
SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
FTI: high performance fault tolerance interface for hybrid systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Checkpointing strategies for parallel jobs
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Evaluating the viability of process replication reliability for exascale systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Modeling and tolerating heterogeneous failures in large parallel systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
System implications of memory reliability in exascale computing
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Job failures in high performance computing systems: A large-scale empirical study
Computers & Mathematics with Applications
Failure prediction and localization in large scientific workflows
Proceedings of the 6th workshop on Workflows in support of large-scale science
HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Online workflow management and performance analysis with stampede
Proceedings of the 7th International Conference on Network and Services Management
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Characterizing fault-tolerance of genetic algorithms in desktop grid systems
EvoCOP'10 Proceedings of the 10th European conference on Evolutionary Computation in Combinatorial Optimization
Application monitoring and checkpointing in HPC: looking towards exascale systems
Proceedings of the 50th Annual Southeast Regional Conference
Evaluating application vulnerability to soft errors in multi-level cache hierarchy
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
On the viability of checkpoint compression for extreme scale fault tolerance
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Future Generation Computer Systems
Checkpointing Orchestration: Toward a Scalable HPC Fault-Tolerant Environment
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Evaluating operating system vulnerability to memory errors
Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
3-Dimensional root cause diagnosis via co-analysis
Proceedings of the 9th international conference on Autonomic computing
Characterizing output bottlenecks in a supercomputer
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Alleviating scalability issues of checkpointing protocols
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A study of DRAM failures in the field
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Fault prediction under the microscope: a closer look into HPC systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Themis: an I/O-efficient MapReduce
Proceedings of the Third ACM Symposium on Cloud Computing
A decentralized approach for mining event correlations in distributed system monitoring
Journal of Parallel and Distributed Computing
Information Systems and e-Business Management
A reliability model for cloud computing for high performance computing applications
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
The viability of using compression to decrease message log sizes
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
A 1 PB/s file system to checkpoint three million MPI tasks
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Using unreliable virtual hardware to inject errors in extreme-scale systems
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
When is multi-version checkpointing needed?
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Evaluating the feasibility of using memory content similarity to improve system resilience
Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers
Performance comparison under failures of MPI and MapReduce: An analytical approach
Future Generation Computer Systems
Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures
ACM Transactions on Storage (TOS)
Journal of Parallel and Distributed Computing
Failure analysis of distributed scientific workflows executing in the cloud
Proceedings of the 8th International Conference on Network and Service Management
DynamicCloudSim: simulating heterogeneity in computational clouds
Proceedings of the 2nd ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies
Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
ACR: automatic checkpoint/restart for soft and hard error protection
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
The Journal of Supercomputing
Predictable quality of service atop degradable distributed systems
Cluster Computing
Evaluating energy savings for checkpoint/restart
E2SC '13 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing
International Journal of High Performance Computing Applications
Making problem diagnosiswork for large-scale, production storage systems
LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
Checkpointing algorithms and fault prediction
Journal of Parallel and Distributed Computing
Resource failures risk assessment modelling in distributed environments
Journal of Systems and Software
Hi-index | 0.02 |
Designing highly dependable systems requires a good understanding of failure characteristics. Unfortunately, little raw data on failures in large IT installations is publicly available. This paper analyzes failure data recently made publicy available by one of the largest high-performance computing sites. The data has been collected over the past 9 years at Los Alamos National Laboratory and includes 23000 failures recorded on more than 20 different systems, mostly large clusters of SMP and NUMA nodes. We study the statistics of the data, including the root cause of failures, the mean time between failures, and the mean time to repair. We find for example that average failure rates differ wildly across systems, ranging from 20-1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate. From one system to another, mean repair time varies from less than an hour to more than a day, and repair times are well modeled by a lognormal distribution.