A case for two-level distributed recovery schemes
Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Checkpointing in distributed computing systems
Journal of Parallel and Distributed Computing
Improving cluster availability using workstation validation
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
An Integrated Approach to Parallel Scheduling Using Gang-Scheduling, Backfilling, and Migration
IEEE Transactions on Parallel and Distributed Systems
Job-Length Estimation and Performance in Backfilling Schedulers
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Networked Windows NT System Field Failure Data Analysis
PRDC '99 Proceedings of the 1999 Pacific Rim International Symposium on Dependable Computing
Performance Analysis of Two Time-Based Coordinated Checkpointing Protocols
PRFTS '97 Proceedings of the 1997 Pacific Rim International Symposium on Fault-Tolerant Systems
Utilization and Predictability in Scheduling the IBM SP2 with Backfilling
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Failure Data Analysis of a Large-Scale Heterogeneous Server Environment
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
A comprehensive model of the supercomputer workload
WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Job Failure Analysis and Its Implications in a Large-Scale Production Grid
E-SCIENCE '06 Proceedings of the Second IEEE International Conference on e-Science and Grid Computing
Why do internet services fail, and what can be done about it?
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Proactive fault tolerance for HPC with Xen virtualization
Proceedings of the 21st annual international conference on Supercomputing
Locality of sampling and diversity in parallel system workloads
Proceedings of the 21st annual international conference on Supercomputing
Exploring event correlation for failure prediction in coalitions of clusters
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Performance under failures of high-end computing
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Scheduling Restartable Jobs with Short Test Runs
Job Scheduling Strategies for Parallel Processing
The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Performance implications of failures in large-scale cluster scheduling
JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Modeling machine availability in enterprise and wide-area distributed computing environments
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Modeling user runtime estimates
JSSPP'05 Proceedings of the 11th international conference on Job Scheduling Strategies for Parallel Processing
The characteristics and performance of groups of jobs in grids
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Hi-index | 0.09 |
The growing complexity and size of High Performance Computing systems (HPCs) lead to frequent job failures, which may cause significant performance degradation. In order to provide high performance and reliable computing services, an in-depth understanding of the characteristics of HPC job failures is essential. In this paper, we present an empirical study on job failures of 10 public workload data sets collected from 8 large-scale HPCs all over the world. Multiple analysis methods are applied to provide a comprehensive and in-depth understanding of job failures. In order to facilitate design, testing and management of HPCs, we study properties of job failures from the following four aspects: proportion in workload and resource consumption, submission inter-arrival time, locality, and runtime. Our analysis results show that job failure rates are significant in most HPCs, and on average, a failed job often consumes more computational resources than a successful job. We also observe that the submission inter-arrival time of failed jobs is better fit by Generalized Pareto and Lognormal distributions, and the probability of failed job submission follows a ''V'' shape: decreasing during the first 100 seconds right after the submission of the last failed job and increasing afterward. The majority of job failures come from a small number of users and applications, and furthermore these users are the primary factor related to job failures compared with these applications. We find evidence that failed jobs' lifetime accuracy (runtime / request time) always follows the ''bathtub curve''. Moreover, job failures exhibit strong locality properties that can support the prediction of failed jobs' occurrence and runtime. Most of these findings are new contributions from the research community, and some findings also reveal important properties of job failures that were misunderstood or poorly understood before. The wide range of studies in this paper can directly and thoroughly facilitate fault tolerant, scheduling, workload modeling, etc. in HPCs, and lead to better system utility while reducing costs.