Application performance and flexibility on exokernel systems
Proceedings of the sixteenth ACM symposium on Operating systems principles
End-to-end arguments in system design
ACM Transactions on Computer Systems (TOCS)
Matchmaking: Distributed Resource Management for High Throughput Computing
HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
Error Scope on a Computational Grid: Theory and Practice
HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
The Ethernet Approach to Grid Computing
HPDC '03 Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing
Condor-G: A Computation Management Agent for Multi-Institutional Grids
HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
Transforming policies into mechanisms with infokernel
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Faults in Grids: Why are they so bad and What can be done about it?
GRID '03 Proceedings of the 4th International Workshop on Grid Computing
Stork: Making Data Placement a First Class Citizen in the Grid
ICDCS '04 Proceedings of the 24th International Conference on Distributed Computing Systems (ICDCS'04)
A fully automated fault-tolerant system for distributed video processing and off-site replication
NOSSDAV '04 Proceedings of the 14th international workshop on Network and operating systems support for digital audio and video
A client-centric grid knowledgebase
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Grid Application Fault Diagnosis Using Wrapper Services and Machine Learning
ICSOC '07 Proceedings of the 5th international conference on Service-Oriented Computing
On grid performance evaluation using synthetic workloads
JSSPP'06 Proceedings of the 12th international conference on Job scheduling strategies for parallel processing
Error detection and error classification: failure awareness in data transfer scheduling
International Journal of Autonomic Computing
Hi-index | 0.01 |
A major hurdle facing data intensive grid applications is the appropriate handling of failures that occur in the grid-environment. Implementing the fault-tolerance transparently at the grid-middleware level would make different data intensive applications fault-tolerant without each having to pay a separate cost and reduce the time to grid-based solution for many scientific problems. We analyzed the failures encountered by four real-life production data intensive applications: NCSA image processing pipeline, WCER video processing pipeline, US-CMS pipeline and BMRB BLAST pipeline. Taking the result of the analysis into account, we have designed and implemented Phoenix, a transparent middleware-level fault-tolerance layer that detects failures early, classifies failures into transient and permanent and appropriately handles the transient failures. We applied our fault-tolerance layer to a prototype of the NCSA image processing pipeline and considerably improved the failure handling and report on the insights gained in the process.