Phoenix: Making Data-Intensive Grid Applications Fault-Tolerant

Authors:
George Kola;Tevfik Kosar;Miron Livny
Affiliations:
University of Wisconsin-Madison;University of Wisconsin-Madison;University of Wisconsin-Madison
Venue:
GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
Year:
2004

Citing 11
Cited 3

Application performance and flexibility on exokernel systems

Proceedings of the sixteenth ACM symposium on Operating systems principles
End-to-end arguments in system design

ACM Transactions on Computer Systems (TOCS)
Matchmaking: Distributed Resource Management for High Throughput Computing

HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
Error Scope on a Computational Grid: Theory and Practice

HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
The Ethernet Approach to Grid Computing

HPDC '03 Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing
Condor-G: A Computation Management Agent for Multi-Institutional Grids

HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
Transforming policies into mechanisms with infokernel

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Faults in Grids: Why are they so bad and What can be done about it?

GRID '03 Proceedings of the 4th International Workshop on Grid Computing
Stork: Making Data Placement a First Class Citizen in the Grid

ICDCS '04 Proceedings of the 24th International Conference on Distributed Computing Systems (ICDCS'04)
A fully automated fault-tolerant system for distributed video processing and off-site replication

NOSSDAV '04 Proceedings of the 14th international workshop on Network and operating systems support for digital audio and video
A client-centric grid knowledgebase

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing

Grid Application Fault Diagnosis Using Wrapper Services and Machine Learning

ICSOC '07 Proceedings of the 5th international conference on Service-Oriented Computing
On grid performance evaluation using synthetic workloads

JSSPP'06 Proceedings of the 12th international conference on Job scheduling strategies for parallel processing
Error detection and error classification: failure awareness in data transfer scheduling

International Journal of Autonomic Computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

A major hurdle facing data intensive grid applications is the appropriate handling of failures that occur in the grid-environment. Implementing the fault-tolerance transparently at the grid-middleware level would make different data intensive applications fault-tolerant without each having to pay a separate cost and reduce the time to grid-based solution for many scientific problems. We analyzed the failures encountered by four real-life production data intensive applications: NCSA image processing pipeline, WCER video processing pipeline, US-CMS pipeline and BMRB BLAST pipeline. Taking the result of the analysis into account, we have designed and implemented Phoenix, a transparent middleware-level fault-tolerance layer that detects failures early, classifies failures into transient and permanent and appropriately handles the transient failures. We applied our fault-tolerance layer to a prototype of the NCSA image processing pipeline and considerably improved the failure handling and report on the insights gained in the process.