The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems

Authors:
Derrick Kondo;Bahman Javadi;Alexandru Iosup;Dick Epema
Affiliations:
-;-;-;-
Venue:
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Year:
2010

Citing 11
Cited 26

Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs

Proceedings of the 2000 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Basic Concepts and Taxonomy of Dependable and Secure Computing

IEEE Transactions on Dependable and Secure Computing
Availability, usage, and deployment characteristics of the domain name system

Proceedings of the 4th ACM SIGCOMM conference on Internet measurement
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Characterizing resource availability in enterprise desktop grids

Future Generation Computer Systems
Characterizing, Modeling and Predicting Dynamic Resource Availability in a Large Scale Multi-purpose Grid

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
The Grid Workloads Archive

Future Generation Computer Systems
Ensuring Collective Availability in Volatile Resource Pools Via Forecasting

DSOM '08 Proceedings of the 19th IFIP/IEEE international workshop on Distributed Systems: Operations and Management: Managing Large-Scale Service Deployment
On the dynamic resource availability in grids

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
Multi-state grid resource availability characterization

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
The grid observatory

GMAC '09 Proceedings of the 6th international conference industry session on Grids meets autonomic computing

Fast and scalable simulation of volunteer computing systems using SimGrid

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
A model for space-correlated failures in large-scale distributed systems

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
CAMEO: enabling social networks for massively multiplayer online games through continuous analytics and cloud computing

Proceedings of the 9th Annual Workshop on Network and Systems Support for Games
Multi-scale analysis of large distributed computing systems

Proceedings of the third international workshop on Large-scale system and application performance
Reducing Repair Traffic in P2P Backup Systems: Exact Regenerating Codes on Hierarchical Codes

ACM Transactions on Storage (TOS)
Checkpointing strategies for parallel jobs

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Modeling and tolerating heterogeneous failures in large parallel systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Job failures in high performance computing systems: A large-scale empirical study

Computers & Mathematics with Applications
Long-term availability prediction for groups of volunteer resources

Journal of Parallel and Distributed Computing
SpeQuloS: a QoS service for BoT applications using best effort distributed computing infrastructures

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Scalable Multi-purpose Network Representation for Large Scale Distributed System Simulation

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Self-Healing of Operational Workflow Incidents on Distributed Computing Infrastructures

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Estimating deadline-miss probabilities of tasks in large distributed systems

GPC'12 Proceedings of the 7th international conference on Advances in Grid and Pervasive Computing
Failure-aware resource provisioning for hybrid Cloud infrastructure

Journal of Parallel and Distributed Computing
Robust Redundancy Scheme for the Repair Process: Hierarchical Codes in the Bandwidth-Limited Systems

Journal of Grid Computing
Detection and analysis of resource usage anomalies in large distributed systems through multi-scale visualization

Concurrency and Computation: Practice & Experience
A User-Based Model of Grid Computing Workloads

GRID '12 Proceedings of the 2012 ACM/IEEE 13th International Conference on Grid Computing
On the checkpointing strategy in desktop grids

IDCS'12 Proceedings of the 5th international conference on Internet and Distributed Computing Systems
Enhancing performance of failure-prone clusters by adaptive provisioning of cloud resources

The Journal of Supercomputing
Characterizing spot price dynamics in public cloud environments

Future Generation Computer Systems
A science-gateway workload archive to study pilot jobs, user activity, bag of tasks, task sub-steps, and workflow executions

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
The Failure Trace Archive: Enabling the comparison of failure measurements and models of distributed systems

Journal of Parallel and Distributed Computing
Self-healing of workflow activity incidents on distributed computing infrastructures

Future Generation Computer Systems
Checkpointing algorithms and fault prediction

Journal of Parallel and Distributed Computing
Modeling Avatar Mobility of Networked Virtual Environments

Proceedings of International Workshop on Massively Multiuser Virtual Environments
SpeQuloS: a QoS service for hybrid and elastic computing infrastructures

Cluster Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the increasing functionality and complexity of distributed systems, resource failures are inevitable. While numerous models and algorithms for dealing with failures exist, the lack of public trace data sets and tools has prevented meaningful comparisons. To facilitate the design, validation, and comparison of fault-tolerant models and algorithms, we have created the Failure Trace Archive (FTA) as an online public repository of availability traces taken from diverse parallel and distributed systems. Our main contributions in this study are the following. First, we describe the design of the archive, in particular the rationale of the standard FTA format, and the design of a toolbox that facilitates automated analysis of trace data sets. Second, applying the toolbox, we present a uniform comparative analysis with statistics and models of failures in nine distributed systems. Third, we show how different interpretations of these data sets can result in different conclusions. This emphasizes the critical need for the public availability of trace data and methods for their analysis.