Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs
Proceedings of the 2000 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Basic Concepts and Taxonomy of Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing
Availability, usage, and deployment characteristics of the domain name system
Proceedings of the 4th ACM SIGCOMM conference on Internet measurement
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Characterizing resource availability in enterprise desktop grids
Future Generation Computer Systems
CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Future Generation Computer Systems
Ensuring Collective Availability in Volatile Resource Pools Via Forecasting
DSOM '08 Proceedings of the 19th IFIP/IEEE international workshop on Distributed Systems: Operations and Management: Managing Large-Scale Service Deployment
On the dynamic resource availability in grids
GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
Multi-state grid resource availability characterization
GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
GMAC '09 Proceedings of the 6th international conference industry session on Grids meets autonomic computing
Fast and scalable simulation of volunteer computing systems using SimGrid
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
A model for space-correlated failures in large-scale distributed systems
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Proceedings of the 9th Annual Workshop on Network and Systems Support for Games
Multi-scale analysis of large distributed computing systems
Proceedings of the third international workshop on Large-scale system and application performance
Reducing Repair Traffic in P2P Backup Systems: Exact Regenerating Codes on Hierarchical Codes
ACM Transactions on Storage (TOS)
Checkpointing strategies for parallel jobs
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Modeling and tolerating heterogeneous failures in large parallel systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Job failures in high performance computing systems: A large-scale empirical study
Computers & Mathematics with Applications
Long-term availability prediction for groups of volunteer resources
Journal of Parallel and Distributed Computing
SpeQuloS: a QoS service for BoT applications using best effort distributed computing infrastructures
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Scalable Multi-purpose Network Representation for Large Scale Distributed System Simulation
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Self-Healing of Operational Workflow Incidents on Distributed Computing Infrastructures
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Estimating deadline-miss probabilities of tasks in large distributed systems
GPC'12 Proceedings of the 7th international conference on Advances in Grid and Pervasive Computing
Failure-aware resource provisioning for hybrid Cloud infrastructure
Journal of Parallel and Distributed Computing
Robust Redundancy Scheme for the Repair Process: Hierarchical Codes in the Bandwidth-Limited Systems
Journal of Grid Computing
Concurrency and Computation: Practice & Experience
A User-Based Model of Grid Computing Workloads
GRID '12 Proceedings of the 2012 ACM/IEEE 13th International Conference on Grid Computing
On the checkpointing strategy in desktop grids
IDCS'12 Proceedings of the 5th international conference on Internet and Distributed Computing Systems
Enhancing performance of failure-prone clusters by adaptive provisioning of cloud resources
The Journal of Supercomputing
Characterizing spot price dynamics in public cloud environments
Future Generation Computer Systems
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Journal of Parallel and Distributed Computing
Self-healing of workflow activity incidents on distributed computing infrastructures
Future Generation Computer Systems
Checkpointing algorithms and fault prediction
Journal of Parallel and Distributed Computing
Modeling Avatar Mobility of Networked Virtual Environments
Proceedings of International Workshop on Massively Multiuser Virtual Environments
SpeQuloS: a QoS service for hybrid and elastic computing infrastructures
Cluster Computing
Hi-index | 0.00 |
With the increasing functionality and complexity of distributed systems, resource failures are inevitable. While numerous models and algorithms for dealing with failures exist, the lack of public trace data sets and tools has prevented meaningful comparisons. To facilitate the design, validation, and comparison of fault-tolerant models and algorithms, we have created the Failure Trace Archive (FTA) as an online public repository of availability traces taken from diverse parallel and distributed systems. Our main contributions in this study are the following. First, we describe the design of the archive, in particular the rationale of the standard FTA format, and the design of a toolbox that facilitates automated analysis of trace data sets. Second, applying the toolbox, we present a uniform comparative analysis with statistics and models of failures in nine distributed systems. Third, we show how different interpretations of these data sets can result in different conclusions. This emphasizes the critical need for the public availability of trace data and methods for their analysis.