Faults in Grids: Why are they so bad and What can be done about it?

  • Authors:
  • Raissa Medeiros;Walfredo Cirne;Francisco Brasileiro;Jacques Sauvé

  • Affiliations:
  • -;-;-;-

  • Venue:
  • GRID '03 Proceedings of the 4th International Workshop on Grid Computing
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Computational Grids have the potential to become themain execution platform for high performance and distributedapplications. However, such systems are extremelycomplex and prone to failures. In this paper, wepresent a survey with the grid community on which severalpeople shared their actual experience regardingfault treatment. The survey reveals that, nowadays, usershave to be highly involved in diagnosing failures, thatmost failures are due to configuration problems (a hint ofthe area's immaturity), and that solutions for dealingwith failures are mainly application-dependent. Goingfurther, we identify two main reasons for this state of affairs.First, grid components that provide high-level abstractionswhen working, do expose all gory details whenbroken. Since there are no appropriate mechanisms todeal with the complexity exposed (configuration, middleware,hardware and software issues), users need to bedeeply involved in the diagnosis and correction of failures.To address this problem, one needs a way to coordinatedifferent support teams working at the grids differentlevels of abstraction. Second, fault tolerance schemestoday implemented on grids tolerate only crash failures.Since grids are prone to more complex failures, suchthose caused by heisenbugs, one needs to toleratetougher failures. Our hope is that the very heterogeneity,that makes a grid a complex environment, can help in thecreation of diverse software replicas, a strategy that cantolerate more complex failures.