GridRM: A Resource Monitoring Architecture for the Grid
GRID '02 Proceedings of the Third International Workshop on Grid Computing
A Model for Moldable Supercomputer Jobs
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Building Diverse Computer Systems
HOTOS '97 Proceedings of the 6th Workshop on Hot Topics in Operating Systems (HotOS-VI)
A Fault Detection Service for Wide Area Distributed Computations
HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
A Monitoring Sensor Management System for Grid Environments
HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
The N-Version Approach to Fault-Tolerant Software
IEEE Transactions on Software Engineering
Phoenix: Making Data-Intensive Grid Applications Fault-Tolerant
GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
Multi-environment software testing on the grid
Proceedings of the 2006 workshop on Parallel and distributed systems: testing and debugging
Proceedings of the 16th international symposium on High performance distributed computing
Autonomic system management in mobile grid environments
ACSW '07 Proceedings of the fifth Australasian symposium on ACSW frontiers - Volume 68
User-friendly and reliable grid computing based on imperfect middleware
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Reliable DAG scheduling on grids with rewinding and migration
Proceedings of the first international conference on Networks for grid applications
Journal of Parallel and Distributed Computing
GridBot: execution of bags of tasks in multiple grids
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Towards fraud detection support using grid technology
Multiagent and Grid Systems - New tendencies on agents and grid environments
Adaptive checkpointing strategy to tolerate faults in economy based grid
The Journal of Supercomputing
Pro-active failure handling mechanisms for scheduling in grid computing environments
Journal of Parallel and Distributed Computing
Application execution management on the InteGrade opportunistic grid middleware
Journal of Parallel and Distributed Computing
On grid performance evaluation using synthetic workloads
JSSPP'06 Proceedings of the 12th international conference on Job scheduling strategies for parallel processing
Extending self-healing in grid environment by pulse monitoring
Multiagent and Grid Systems
Introducing mobile devices into Grid systems: a survey
International Journal of Web and Grid Services
MAG: a mobile agent based computational grid platform
GCC'05 Proceedings of the 4th international conference on Grid and Cooperative Computing
Embarrassingly distributed and master-worker paradigms on the grid
SAG'04 Proceedings of the First international conference on Scientific Applications of Grid Computing
Fault-Tolerant scheduling for bag-of-tasks grid applications
EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing
Replication based fault tolerant job scheduling strategy for economy driven grid
The Journal of Supercomputing
Implementation of the fault tolerance in computational grid using agents by meta-modelling approach
International Journal of Communication Networks and Distributed Systems
Hi-index | 0.00 |
Computational Grids have the potential to become themain execution platform for high performance and distributedapplications. However, such systems are extremelycomplex and prone to failures. In this paper, wepresent a survey with the grid community on which severalpeople shared their actual experience regardingfault treatment. The survey reveals that, nowadays, usershave to be highly involved in diagnosing failures, thatmost failures are due to configuration problems (a hint ofthe area's immaturity), and that solutions for dealingwith failures are mainly application-dependent. Goingfurther, we identify two main reasons for this state of affairs.First, grid components that provide high-level abstractionswhen working, do expose all gory details whenbroken. Since there are no appropriate mechanisms todeal with the complexity exposed (configuration, middleware,hardware and software issues), users need to bedeeply involved in the diagnosis and correction of failures.To address this problem, one needs a way to coordinatedifferent support teams working at the grids differentlevels of abstraction. Second, fault tolerance schemestoday implemented on grids tolerate only crash failures.Since grids are prone to more complex failures, suchthose caused by heisenbugs, one needs to toleratetougher failures. Our hope is that the very heterogeneity,that makes a grid a complex environment, can help in thecreation of diverse software replicas, a strategy that cantolerate more complex failures.