The Failure Trace Archive: Enabling the comparison of failure measurements and models of distributed systems

Authors:
Bahman Javadi;Derrick Kondo;Alexandru Iosup;Dick Epema
Affiliations:
-;-;-;-
Venue:
Journal of Parallel and Distributed Computing
Year:
2013

Citing 42
Cited 0

The interaction of parallel and sequential workloads on a network of workstations

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs

Proceedings of the 2000 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Theory and Practice in Parallel Job Scheduling

IPPS '97 Proceedings of the Job Scheduling Strategies for Parallel Processing
Selective Reservation Strategies for Backfill Job Scheduling

JSSPP '02 Revised Papers from the 8th International Workshop on Job Scheduling Strategies for Parallel Processing
XtremWeb: A Generic Global Computing System

CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
Basic Concepts and Taxonomy of Dependable and Secure Computing

IEEE Transactions on Dependable and Secure Computing
Availability, usage, and deployment characteristics of the domain name system

Proceedings of the 4th ACM SIGCOMM conference on Internet measurement
BOINC: A System for Public-Resource Computing and Storage

GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
The Grid2003 Production Grid: Principles and Practice

HPDC '04 Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Characterizing resource availability in enterprise desktop grids

Future Generation Computer Systems
Build-and-Test Workloads for Grid Middleware: Problem, Analysis, and Applications

CCGRID '07 Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid
Subtleties in tolerating correlated failures in wide-area storage systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Cluster computing for web-scale data processing

Proceedings of the 39th SIGCSE technical symposium on Computer science education
The Grid Workloads Archive

Future Generation Computer Systems
A toolkit for modelling and simulating data Grids: an extension to GridSim

Concurrency and Computation: Practice & Experience
Ensuring Collective Availability in Volatile Resource Pools Via Forecasting

DSOM '08 Proceedings of the 19th IFIP/IEEE international workshop on Distributed Systems: Operations and Management: Managing Large-Scale Service Deployment
Hadoop at home: large-scale computing at a small college

Proceedings of the 40th ACM technical symposium on Computer science education
On the dynamic resource availability in grids

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
Multi-state grid resource availability characterization

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
The grid observatory

GMAC '09 Proceedings of the 6th international conference industry session on Grids meets autonomic computing
A survey of online failure prediction methods

ACM Computing Surveys (CSUR)
Prospects of collaboration between compute providers by means of job interchange

JSSPP'07 Proceedings of the 13th international conference on Job scheduling strategies for parallel processing
The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Fast and scalable simulation of volunteer computing systems using SimGrid

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
A flexible checkpoint/restart model in distributed systems

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
A model for space-correlated failures in large-scale distributed systems

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
The peer-to-peer trace archive: design and comparative trace analysis

Proceedings of the ACM CoNEXT Student Workshop
Grid Computing Workloads

IEEE Internet Computing
Performance Analysis of Cloud Computing Services for Many-Tasks Scientific Computing

IEEE Transactions on Parallel and Distributed Systems
Non-cooperative Scheduling Considered Harmful in Collaborative Volunteer Computing Environments

CCGRID '11 Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
A Scheduling and Certification Algorithm for Defeating Collusion in Desktop Grids

ICDCS '11 Proceedings of the 2011 31st International Conference on Distributed Computing Systems
Using a new event-based simulation framework for investigating resource provisioning in Clouds

Scientific Programming - Science-Driven Cloud Computing
Reducing Repair Traffic in P2P Backup Systems: Exact Regenerating Codes on Hierarchical Codes

ACM Transactions on Storage (TOS)
Discovering Statistical Models of Availability in Large Distributed Systems: An Empirical Study of SETI@home

IEEE Transactions on Parallel and Distributed Systems
Checkpointing strategies for parallel jobs

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Long-term availability prediction for groups of volunteer resources

Journal of Parallel and Distributed Computing
Experiences in running workloads over grid3

GCC'05 Proceedings of the 4th international conference on Grid and Cooperative Computing
Workload characteristics of a multi-cluster supercomputer

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Experiences teaching MapReduce in the cloud

Proceedings of the 43rd ACM technical symposium on Computer Science Education
Failure-aware resource provisioning for hybrid Cloud infrastructure

Journal of Parallel and Distributed Computing
Inside dropbox: understanding personal cloud storage services

Proceedings of the 2012 ACM conference on Internet measurement conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the increasing presence, scale, and complexity of distributed systems, resource failures are becoming an important and practical topic of computer science research. While numerous failure models and failure-aware algorithms exist, their comparison has been hampered by the lack of public failure data sets and data processing tools. To facilitate the design, validation, and comparison of fault-tolerant models and algorithms, we have created the Failure Trace Archive (FTA)-an online, public repository of failure traces collected from diverse parallel and distributed systems. In this work, we first describe the design of the archive, in particular of the standard FTA data format, and the design of a toolbox that facilitates automated analysis of trace data sets. We also discuss the use of the FTA for various current and future purposes. Second, after applying the toolbox to nine failure traces collected from distributed systems used in various application domains (e.g., HPC, Internet operation, and various online applications), we present a comparative analysis of failures in various distributed systems. Our analysis presents various statistical insights and typical statistical modeling results for the availability of individual resources in various distributed systems. The analysis results underline the need for public availability of trace data from different distributed systems. Last, we show how different interpretations of the meaning of failure data can result in different conclusions for failure modeling and job scheduling in distributed systems. Our results for different interpretations show evidence that there may be a need for further revisiting existing failure-aware algorithms, when applied for general rather than for domain-specific distributed systems.