Modeling machine availability in enterprise and wide-area distributed computing environments

Authors:
Daniel Nurmi;John Brevik;Rich Wolski
Affiliations:
Department of Computer Science, University of California, Santa Barbara, Santa Barbara, CA;Department of Computer Science, University of California, Santa Barbara, Santa Barbara, CA;Department of Computer Science, University of California, Santa Barbara, Santa Barbara, CA
Venue:
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Year:
2005

Citing 23
Cited 33

Goodness-of-fit techniques

Goodness-of-fit techniques
Self-similarity through high-variability: statistical analysis of ethernet LAN traffic at the source level

SIGCOMM '95 Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
Exploiting process lifetime distributions for dynamic load balancing

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Exploiting process lifetime distributions for dynamic load balancing

ACM Transactions on Computer Systems (TOCS)
Why we don't know how to simulate the Internet

Proceedings of the 29th conference on Winter simulation
The grid: blueprint for a new computing infrastructure

The grid: blueprint for a new computing infrastructure
Load-balancing heuristics and process behavior

SIGMETRICS '86/PERFORMANCE '86 Proceedings of the 1986 ACM SIGMETRICS joint international conference on Computer performance modelling, measurement and evaluation
The MicroGrid: a scientific tool for modeling computational gridsr

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Chord: A scalable peer-to-peer lookup service for internet applications

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Numerical libraries and the grid: the GrADS experiments with ScaLAPACK

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
The Vision of Autonomic Computing

Computer
Cactus Application: Performance Predictions in Grid Environments

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Profiling Workstations' Available Capacity for Remote Execution

Performance '87 Proceedings of the 12th IFIP WG 7.3 International Symposium on Computer Performance Modelling, Measurement and Evaluation
Adaptive Computing on the Grid Using AppLeS

IEEE Transactions on Parallel and Distributed Systems
Simgrid: A Toolkit for the Simulation of Application Scheduling

CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Performance Analysis of Scheduling and Replication Algorithms on Grid Datafarm Architecture for High-Energy Physics Applications

HPDC '03 Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing
A longitudinal survey of Internet host reliability

SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
High Performance Parametric Modeling with Nimrod/G: Killer Application for the Global Grid?

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
The GrADS Project: Software Support for High-Level Grid Application Development

International Journal of High Performance Computing Applications
Software Reliability Models: Assumptions, Limitations, and Applicability

IEEE Transactions on Software Engineering
Effect of System Workload on Operating System Reliability: A Study on IBM 3081

IEEE Transactions on Software Engineering
Tapestry: a resilient global-scale overlay for service deployment

IEEE Journal on Selected Areas in Communications

Predicting bounds on queuing delay for batch-scheduled parallel machines

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
MJSA: Markov job scheduler based on availability in desktop grid computing environment

Future Generation Computer Systems
Using queue structures to improve job reliability

Proceedings of the 16th international symposium on High performance distributed computing
Ridge: combining reliability and performance in open grid platforms

Proceedings of the 16th international symposium on High performance distributed computing
Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?

ACM Transactions on Storage (TOS)
Designing less-structured P2P systems for the expected high churn

IEEE/ACM Transactions on Networking (TON)
On the dynamic resource availability in grids

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
An analysis of clustered failures on large supercomputing systems

Journal of Parallel and Distributed Computing
Exploring data reliability tradeoffs in replicated storage systems

Proceedings of the 18th ACM international symposium on High performance distributed computing
A Survey on Approximation Algorithms for Scheduling with Machine Unavailability

Algorithmics of Large and Complex Networks
Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
A near-optimal database allocation for reducing the average waiting time in the grid computing environment

Information Sciences: an International Journal
Dynamic scheduling for heterogeneous Desktop Grids

GRID '08 Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing
Fault-management in P2P-MPI

International Journal of Parallel Programming
QBETS: queue bounds estimation from time series

JSSPP'07 Proceedings of the 13th international conference on Job scheduling strategies for parallel processing
Availability Prediction Based Replication Strategies for Grid Environments

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Predicting the Quality of Service of a Peer-to-Peer Desktop Grid

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Fast and scalable simulation of volunteer computing systems using SimGrid

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Using Monte Carlo simulation for improving data availability in P2P network

Proceedings of the Fourteenth International Database Engineering & Applications Symposium
Dynamic scheduling for heterogeneous Desktop Grids

Journal of Parallel and Distributed Computing
Performance comparison of erasure codes for different churn models in P2P storage systems

ICIC'10 Proceedings of the Advanced intelligent computing theories and applications, and 6th international conference on Intelligent computing
Risk aware overbooking for commercial grids

JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
Job-scheduling via resource availability prediction for volunteer computational grids

International Journal of Grid and Utility Computing
Flexible resource allocation for reliable virtual cluster computing systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Job failures in high performance computing systems: A large-scale empirical study

Computers & Mathematics with Applications
Failure data-driven selective node-level duplication to improve MTTF in high performance computing systems

HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Estimating deadline-miss probabilities of tasks in large distributed systems

GPC'12 Proceedings of the 7th international conference on Advances in Grid and Pervasive Computing
A User-Based Model of Grid Computing Workloads

GRID '12 Proceedings of the 2012 ACM/IEEE 13th International Conference on Grid Computing
Assessing MapReduce for Internet Computing: A Comparison of Hadoop and BitDew-MapReduce

GRID '12 Proceedings of the 2012 ACM/IEEE 13th International Conference on Grid Computing
On the checkpointing strategy in desktop grids

IDCS'12 Proceedings of the 5th international conference on Internet and Distributed Computing Systems
Searching for Translated Plagiarism with the Help of Desktop Grids

Journal of Grid Computing
Autonomous massively multiplayer online game operation on unreliable resources

Proceedings of the International C* Conference on Computer Science and Software Engineering
Resource failures risk assessment modelling in distributed environments

Journal of Systems and Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we consider the problem of modeling machine availability in enterprise-area and wide-area distributed computing settings. Using availability data gathered from three different environments, we detail the suitability of four potential statistical distributions for each data set: exponential, Pareto, Weibull, and hyperexponential. In each case, we use software we have developed to determine the necessary parameters automatically from each data collection. To gauge suitability, we present both graphical and statistical evaluations of the accuracy with each distribution fits each data set. For all three data sets, we find that a hyperexponential model fits slightly more accurately than a Weibull, but that both are substantially better choices than either an exponential or Pareto. These results indicate that either a hyperexponential or Weibull model effectively represents machine availability in enterprise and Internet computing environments.