Resource failures risk assessment modelling in distributed environments

Authors:
Raid Alsoghayer;Karim Djemame
Affiliations:
-;-
Venue:
Journal of Systems and Software
Year:
2014

Citing 13
Cited 0

Reliable computer systems (3rd ed.): design and evaluation

Reliable computer systems (3rd ed.): design and evaluation
Improving cluster availability using workstation validation

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Markov Modeling for Reliability Analysis

Markov Modeling for Reliability Analysis
Grid Computing: Making the Global Infrastructure a Reality

Grid Computing: Making the Global Infrastructure a Reality
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Introducing Risk Management into the Grid

E-SCIENCE '06 Proceedings of the Second IEEE International Conference on e-Science and Grid Computing
Optimal task partition and distribution in grid service system with common cause failures

Future Generation Computer Systems - Special section: Information engineering and enterprise architecture in distributed computing environments
Failure trends in a large disk drive population

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Characterizing, Modeling and Predicting Dynamic Resource Availability in a Large Scale Multi-purpose Grid

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
On the dynamic resource availability in grids

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
An analysis of clustered failures on large supercomputing systems

Journal of Parallel and Distributed Computing
Modeling machine availability in enterprise and wide-area distributed computing environments

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
OPTIMIS: A holistic approach to cloud service provisioning

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Service providers offer access to resources and services in distributed environments such as Grids and Clouds through formal Service level Agreements (SLA), and need well-balanced infrastructures so that they can maximise the Quality of Service (QoS) they offer and minimise the number of SLA violations. We propose a mathematical model to predict the risk of failure of resources in such environments using a discrete-time analytical model driven by reliability functions fitted to observed data. The model relies on the resource historical data so as to predict the risk of failure for a given time interval. The model is evaluated by comparing the predicted risk of failure with the observed risk of failure, and is shown to accurately predict the resources risk of failure, allowing a service provider to selectively choose which SLA request to accept.