The interaction of parallel and sequential workloads on a network of workstations
Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs
Proceedings of the 2000 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Theory and Practice in Parallel Job Scheduling
IPPS '97 Proceedings of the Job Scheduling Strategies for Parallel Processing
Selective Reservation Strategies for Backfill Job Scheduling
JSSPP '02 Revised Papers from the 8th International Workshop on Job Scheduling Strategies for Parallel Processing
XtremWeb: A Generic Global Computing System
CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
Basic Concepts and Taxonomy of Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing
Availability, usage, and deployment characteristics of the domain name system
Proceedings of the 4th ACM SIGCOMM conference on Internet measurement
BOINC: A System for Public-Resource Computing and Storage
GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
The Grid2003 Production Grid: Principles and Practice
HPDC '04 Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Characterizing resource availability in enterprise desktop grids
Future Generation Computer Systems
Build-and-Test Workloads for Grid Middleware: Problem, Analysis, and Applications
CCGRID '07 Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid
Subtleties in tolerating correlated failures in wide-area storage systems
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Cluster computing for web-scale data processing
Proceedings of the 39th SIGCSE technical symposium on Computer science education
Future Generation Computer Systems
A toolkit for modelling and simulating data Grids: an extension to GridSim
Concurrency and Computation: Practice & Experience
Ensuring Collective Availability in Volatile Resource Pools Via Forecasting
DSOM '08 Proceedings of the 19th IFIP/IEEE international workshop on Distributed Systems: Operations and Management: Managing Large-Scale Service Deployment
Hadoop at home: large-scale computing at a small college
Proceedings of the 40th ACM technical symposium on Computer science education
On the dynamic resource availability in grids
GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
Multi-state grid resource availability characterization
GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
GMAC '09 Proceedings of the 6th international conference industry session on Grids meets autonomic computing
A survey of online failure prediction methods
ACM Computing Surveys (CSUR)
Prospects of collaboration between compute providers by means of job interchange
JSSPP'07 Proceedings of the 13th international conference on Job scheduling strategies for parallel processing
The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Fast and scalable simulation of volunteer computing systems using SimGrid
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
A flexible checkpoint/restart model in distributed systems
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
A model for space-correlated failures in large-scale distributed systems
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
The peer-to-peer trace archive: design and comparative trace analysis
Proceedings of the ACM CoNEXT Student Workshop
IEEE Internet Computing
Performance Analysis of Cloud Computing Services for Many-Tasks Scientific Computing
IEEE Transactions on Parallel and Distributed Systems
Non-cooperative Scheduling Considered Harmful in Collaborative Volunteer Computing Environments
CCGRID '11 Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
A Scheduling and Certification Algorithm for Defeating Collusion in Desktop Grids
ICDCS '11 Proceedings of the 2011 31st International Conference on Distributed Computing Systems
Using a new event-based simulation framework for investigating resource provisioning in Clouds
Scientific Programming - Science-Driven Cloud Computing
Reducing Repair Traffic in P2P Backup Systems: Exact Regenerating Codes on Hierarchical Codes
ACM Transactions on Storage (TOS)
IEEE Transactions on Parallel and Distributed Systems
Checkpointing strategies for parallel jobs
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Long-term availability prediction for groups of volunteer resources
Journal of Parallel and Distributed Computing
Experiences in running workloads over grid3
GCC'05 Proceedings of the 4th international conference on Grid and Cooperative Computing
Workload characteristics of a multi-cluster supercomputer
JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Experiences teaching MapReduce in the cloud
Proceedings of the 43rd ACM technical symposium on Computer Science Education
Failure-aware resource provisioning for hybrid Cloud infrastructure
Journal of Parallel and Distributed Computing
Inside dropbox: understanding personal cloud storage services
Proceedings of the 2012 ACM conference on Internet measurement conference
Hi-index | 0.00 |
With the increasing presence, scale, and complexity of distributed systems, resource failures are becoming an important and practical topic of computer science research. While numerous failure models and failure-aware algorithms exist, their comparison has been hampered by the lack of public failure data sets and data processing tools. To facilitate the design, validation, and comparison of fault-tolerant models and algorithms, we have created the Failure Trace Archive (FTA)-an online, public repository of failure traces collected from diverse parallel and distributed systems. In this work, we first describe the design of the archive, in particular of the standard FTA data format, and the design of a toolbox that facilitates automated analysis of trace data sets. We also discuss the use of the FTA for various current and future purposes. Second, after applying the toolbox to nine failure traces collected from distributed systems used in various application domains (e.g., HPC, Internet operation, and various online applications), we present a comparative analysis of failures in various distributed systems. Our analysis presents various statistical insights and typical statistical modeling results for the availability of individual resources in various distributed systems. The analysis results underline the need for public availability of trace data from different distributed systems. Last, we show how different interpretations of the meaning of failure data can result in different conclusions for failure modeling and job scheduling in distributed systems. Our results for different interpretations show evidence that there may be a need for further revisiting existing failure-aware algorithms, when applied for general rather than for domain-specific distributed systems.