Faults in Grids: Why are they so bad and What can be done about it?

Authors:
Raissa Medeiros;Walfredo Cirne;Francisco Brasileiro;Jacques Sauvé
Affiliations:
-;-;-;-
Venue:
GRID '03 Proceedings of the 4th International Workshop on Grid Computing
Year:
2003

Citing 7
Cited 21

Wide-Area Computing: Resource Sharing on a Large Scale

Computer
GridRM: A Resource Monitoring Architecture for the Grid

GRID '02 Proceedings of the Third International Workshop on Grid Computing
A Model for Moldable Supercomputer Jobs

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Building Diverse Computer Systems

HOTOS '97 Proceedings of the 6th Workshop on Hot Topics in Operating Systems (HotOS-VI)
A Fault Detection Service for Wide Area Distributed Computations

HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
A Monitoring Sensor Management System for Grid Environments

HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
The N-Version Approach to Fault-Tolerant Software

IEEE Transactions on Software Engineering

Phoenix: Making Data-Intensive Grid Applications Fault-Tolerant

GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
Multi-environment software testing on the grid

Proceedings of the 2006 workshop on Parallel and distributed systems: testing and debugging
Fault-tolerant scheduling for differentiated classes of tasks with low replication cost in computational grids

Proceedings of the 16th international symposium on High performance distributed computing
Autonomic system management in mobile grid environments

ACSW '07 Proceedings of the fifth Australasian symposium on ACSW frontiers - Volume 68
User-friendly and reliable grid computing based on imperfect middleware

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Reliable DAG scheduling on grids with rewinding and migration

Proceedings of the first international conference on Networks for grid applications
On the design of communication-aware fault-tolerant scheduling algorithms for precedence constrained tasks in grid computing systems with dedicated communication devices

Journal of Parallel and Distributed Computing
GridBot: execution of bags of tasks in multiple grids

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Towards fraud detection support using grid technology

Multiagent and Grid Systems - New tendencies on agents and grid environments
Adaptive checkpointing strategy to tolerate faults in economy based grid

The Journal of Supercomputing
Pro-active failure handling mechanisms for scheduling in grid computing environments

Journal of Parallel and Distributed Computing
Application execution management on the InteGrade opportunistic grid middleware

Journal of Parallel and Distributed Computing
On grid performance evaluation using synthetic workloads

JSSPP'06 Proceedings of the 12th international conference on Job scheduling strategies for parallel processing
Extending self-healing in grid environment by pulse monitoring

Multiagent and Grid Systems
Introducing mobile devices into Grid systems: a survey

International Journal of Web and Grid Services
Providing resiliency for optical grids by exploiting relocation: A dimensioning study based on ILP

Computer Communications
MAG: a mobile agent based computational grid platform

GCC'05 Proceedings of the 4th international conference on Grid and Cooperative Computing
Embarrassingly distributed and master-worker paradigms on the grid

SAG'04 Proceedings of the First international conference on Scientific Applications of Grid Computing
Fault-Tolerant scheduling for bag-of-tasks grid applications

EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing
Replication based fault tolerant job scheduling strategy for economy driven grid

The Journal of Supercomputing
Implementation of the fault tolerance in computational grid using agents by meta-modelling approach

International Journal of Communication Networks and Distributed Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Computational Grids have the potential to become themain execution platform for high performance and distributedapplications. However, such systems are extremelycomplex and prone to failures. In this paper, wepresent a survey with the grid community on which severalpeople shared their actual experience regardingfault treatment. The survey reveals that, nowadays, usershave to be highly involved in diagnosing failures, thatmost failures are due to configuration problems (a hint ofthe area's immaturity), and that solutions for dealingwith failures are mainly application-dependent. Goingfurther, we identify two main reasons for this state of affairs.First, grid components that provide high-level abstractionswhen working, do expose all gory details whenbroken. Since there are no appropriate mechanisms todeal with the complexity exposed (configuration, middleware,hardware and software issues), users need to bedeeply involved in the diagnosis and correction of failures.To address this problem, one needs a way to coordinatedifferent support teams working at the grids differentlevels of abstraction. Second, fault tolerance schemestoday implemented on grids tolerate only crash failures.Since grids are prone to more complex failures, suchthose caused by heisenbugs, one needs to toleratetougher failures. Our hope is that the very heterogeneity,that makes a grid a complex environment, can help in thecreation of diverse software replicas, a strategy that cantolerate more complex failures.