On the dynamic resource availability in grids

Authors:
Alexandru Iosup;Mathieu Jan;Ozan Sonmez;Dick H. J. Epema
Affiliations:
Faculty of Electrical Engineering, Mathematics, and Computer Science, Delft University of Technology, The Netherlands;LRI/INRIA Futurs, University of Paris South, France;Faculty of Electrical Engineering, Mathematics, and Computer Science, Delft University of Technology, The Netherlands;Faculty of Electrical Engineering, Mathematics, and Computer Science, Delft University of Technology, The Netherlands
Venue:
GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
Year:
2007

Citing 12
Cited 22

The utility of exploiting idle workstations for parallel computation

SIGMETRICS '97 Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
On-line algorithms

ACM Computing Surveys (CSUR)
Dependability Measurement and Modeling of a Multicomputer System

IEEE Transactions on Computers
On-line Scheduling

Developments from a June 1996 seminar on Online algorithms: the state of the art
A longitudinal survey of Internet host reliability

SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
Measurement, modeling, and analysis of a peer-to-peer file-sharing workload

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
The NorduGrid production Grid infrastructure, status and plans

GRID '03 Proceedings of the 4th International Workshop on Grid Computing
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
A batch scheduler with high level components

CCGRID '05 Proceedings of the Fifth IEEE International Symposium on Cluster Computing and the Grid (CCGrid'05) - Volume 2 - Volume 02
Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed

International Journal of High Performance Computing Applications
Performance implications of failures in large-scale cluster scheduling

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Modeling machine availability in enterprise and wide-area distributed computing environments

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing

The Grid Workloads Archive

Future Generation Computer Systems
File grouping for scientific data management: lessons from experimenting with real traces

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
The performance of bags-of-tasks in large-scale distributed systems

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
DGSim: Comparing Grid Resource Management Architectures through Trace-Based Simulation

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
A performance study of grid workflow engines

GRID '08 Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing
Troubleshooting thousands of jobs on production grids using data mining techniques

GRID '08 Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing
Workload characterization in a high-energy data grid and impact on resource management

Cluster Computing
Performance evaluation of an application-level checkpointing solution on grids

Future Generation Computer Systems
The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
A model for space-correlated failures in large-scale distributed systems

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Scheduling scientific workflows to meet soft deadlines in the absence of failure models

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Risk aware overbooking for commercial grids

JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
The importance of complete data sets for job scheduling simulations

JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
Business process scheduling with resource availability constraints

OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems - Volume Part I
A Robust and Efficient Message Passing Library for Volunteer Computing Environments

Journal of Grid Computing
WiGriMMA: A Wireless Grid Monitoring Model Using Agents

Journal of Grid Computing
The XtreemOS Resource Selection Service

ACM Transactions on Autonomous and Adaptive Systems (TAAS) - Special Section: Extended Version of SASO 2011 Best Paper
Probabilistic versus possibilistic risk assessment models for optimal service level agreements in grid computing

Information Systems and e-Business Management
The Failure Trace Archive: Enabling the comparison of failure measurements and models of distributed systems

Journal of Parallel and Distributed Computing
Autonomous massively multiplayer online game operation on unreliable resources

Proceedings of the International C* Conference on Computer Science and Software Engineering
Resource failures risk assessment modelling in distributed environments

Journal of Systems and Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

Currently deployed grids gather together thousands of computational and storage resources for the benefit of a large community of scientists. However, the large scale, the wide geographical spread, and at times the decision of the rightful resource owners to commit the capacity elsewhere, raises serious resource availability issues. Little is known about the characteristics of the grid resource availability, and of the impact of resource unavailability on the performance of grids. In this work, we make first steps in addressing this twofold lack of information. First, we analyze a long-term availability trace and assess the resource availability characteristics of Grid’5000, an experimental grid environment of over 2,500 processors. The average utilization for the studied trace is increased by almost 5%, when availability is considered. Based on the results of the analysis, we further propose a model for grid resource availability. Our analysis and modeling results show that grid computational resources become unavailable at a high rate, negatively affecting the ability of grids to execute long jobs. Second, through trace-based simulation, we show evidence that resource availability can have a severe impact on the performance of the grid systems. The results of this step show evidence that the performance of a grid system can rise when availability is taken into consideration, and that human administration of availability change information results in 10–15 times more job failures than for an automated monitoring solution, even for a lowly utilized system.