Reliable computer systems (3rd ed.): design and evaluation
Reliable computer systems (3rd ed.): design and evaluation
Improving cluster availability using workstation validation
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Markov Modeling for Reliability Analysis
Markov Modeling for Reliability Analysis
Grid Computing: Making the Global Infrastructure a Reality
Grid Computing: Making the Global Infrastructure a Reality
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Introducing Risk Management into the Grid
E-SCIENCE '06 Proceedings of the Second IEEE International Conference on e-Science and Grid Computing
Optimal task partition and distribution in grid service system with common cause failures
Future Generation Computer Systems - Special section: Information engineering and enterprise architecture in distributed computing environments
Failure trends in a large disk drive population
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
On the dynamic resource availability in grids
GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
An analysis of clustered failures on large supercomputing systems
Journal of Parallel and Distributed Computing
Modeling machine availability in enterprise and wide-area distributed computing environments
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
OPTIMIS: A holistic approach to cloud service provisioning
Future Generation Computer Systems
Hi-index | 0.00 |
Service providers offer access to resources and services in distributed environments such as Grids and Clouds through formal Service level Agreements (SLA), and need well-balanced infrastructures so that they can maximise the Quality of Service (QoS) they offer and minimise the number of SLA violations. We propose a mathematical model to predict the risk of failure of resources in such environments using a discrete-time analytical model driven by reliability functions fitted to observed data. The model relies on the resource historical data so as to predict the risk of failure for a given time interval. The model is evaluated by comparing the predicted risk of failure with the observed risk of failure, and is shown to accurately predict the resources risk of failure, allowing a service provider to selectively choose which SLA request to accept.