The utility of exploiting idle workstations for parallel computation
SIGMETRICS '97 Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
ACM Computing Surveys (CSUR)
Dependability Measurement and Modeling of a Multicomputer System
IEEE Transactions on Computers
Developments from a June 1996 seminar on Online algorithms: the state of the art
A longitudinal survey of Internet host reliability
SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
Measurement, modeling, and analysis of a peer-to-peer file-sharing workload
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
The NorduGrid production Grid infrastructure, status and plans
GRID '03 Proceedings of the 4th International Workshop on Grid Computing
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
A batch scheduler with high level components
CCGRID '05 Proceedings of the Fifth IEEE International Symposium on Cluster Computing and the Grid (CCGrid'05) - Volume 2 - Volume 02
Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed
International Journal of High Performance Computing Applications
Performance implications of failures in large-scale cluster scheduling
JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Modeling machine availability in enterprise and wide-area distributed computing environments
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Future Generation Computer Systems
File grouping for scientific data management: lessons from experimenting with real traces
HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
The performance of bags-of-tasks in large-scale distributed systems
HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
DGSim: Comparing Grid Resource Management Architectures through Trace-Based Simulation
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
A performance study of grid workflow engines
GRID '08 Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing
Troubleshooting thousands of jobs on production grids using data mining techniques
GRID '08 Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing
Performance evaluation of an application-level checkpointing solution on grids
Future Generation Computer Systems
The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
A model for space-correlated failures in large-scale distributed systems
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Scheduling scientific workflows to meet soft deadlines in the absence of failure models
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Risk aware overbooking for commercial grids
JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
The importance of complete data sets for job scheduling simulations
JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
Business process scheduling with resource availability constraints
OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems - Volume Part I
A Robust and Efficient Message Passing Library for Volunteer Computing Environments
Journal of Grid Computing
WiGriMMA: A Wireless Grid Monitoring Model Using Agents
Journal of Grid Computing
The XtreemOS Resource Selection Service
ACM Transactions on Autonomous and Adaptive Systems (TAAS) - Special Section: Extended Version of SASO 2011 Best Paper
Information Systems and e-Business Management
Journal of Parallel and Distributed Computing
Autonomous massively multiplayer online game operation on unreliable resources
Proceedings of the International C* Conference on Computer Science and Software Engineering
Resource failures risk assessment modelling in distributed environments
Journal of Systems and Software
Hi-index | 0.00 |
Currently deployed grids gather together thousands of computational and storage resources for the benefit of a large community of scientists. However, the large scale, the wide geographical spread, and at times the decision of the rightful resource owners to commit the capacity elsewhere, raises serious resource availability issues. Little is known about the characteristics of the grid resource availability, and of the impact of resource unavailability on the performance of grids. In this work, we make first steps in addressing this twofold lack of information. First, we analyze a long-term availability trace and assess the resource availability characteristics of Grid’5000, an experimental grid environment of over 2,500 processors. The average utilization for the studied trace is increased by almost 5%, when availability is considered. Based on the results of the analysis, we further propose a model for grid resource availability. Our analysis and modeling results show that grid computational resources become unavailable at a high rate, negatively affecting the ability of grids to execute long jobs. Second, through trace-based simulation, we show evidence that resource availability can have a severe impact on the performance of the grid systems. The results of this step show evidence that the performance of a grid system can rise when availability is taken into consideration, and that human administration of availability change information results in 10–15 times more job failures than for an automated monitoring solution, even for a lowly utilized system.