Self healing in System-S

Authors:
Gabriela Jacques-Silva;Jim Challenger;Lou Degenaro;James Giles;Rohit Wagle
Affiliations:
Center for Reliable and High-Performance Computing, University of Illinois at Urbana Champaign, Urbana, USA 61820;IBM T.J. Watson Research Center, IBM Research, Hawthorne, USA 10532;IBM T.J. Watson Research Center, IBM Research, Hawthorne, USA 10532;IBM T.J. Watson Research Center, IBM Research, Hawthorne, USA 10532;IBM T.J. Watson Research Center, IBM Research, Hawthorne, USA 10532
Venue:
Cluster Computing
Year:
2008

Citing 16
Cited 1

Measurement and modeling of computer reliability as affected by system activity

ACM Transactions on Computer Systems (TOCS)
A Gate-Level Simulation Environment for Alpha-Particle-Induced Transient Faults

IEEE Transactions on Computers
Fault behavior dictionary for simulation of device-level transients

ICCAD '93 Proceedings of the 1993 IEEE/ACM international conference on Computer-aided design
The Vision of Autonomic Computing

Computer
The Effects of an ARMOR-Based SIFT Environment on the Performance and Dependability of User Applications

IEEE Transactions on Software Engineering
Hibernate in Action (In Action series)

Hibernate in Action (In Action series)
Fault-tolerance in the Borealis distributed stream processing system

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Remote Repair of Operating System State Using Backdoors

ICAC '04 Proceedings of the First International Conference on Autonomic Computing
An Intrusion-Tolerant and Self-Recoverable Network Service System Using A Security Enhanced Chip Multiprocessor

ICAC '05 Proceedings of the Second International Conference on Automatic Computing
Distributed Stream Management using Utility-Driven Self-Adaptive Middleware

ICAC '05 Proceedings of the Second International Conference on Automatic Computing
Design, implementation, and evaluation of the linear road bnchmark on the stream processing core

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Multi-site cooperative data stream analysis

ACM SIGOPS Operating Systems Review
Adaptive Control of Extreme-scale Stream Processing Systems

ICDCS '06 Proceedings of the 26th IEEE International Conference on Distributed Computing Systems
Towards Autonomic Fault Recovery in System-S

ICAC '07 Proceedings of the Fourth International Conference on Autonomic Computing
The Laundromat Model for Autonomic Cluster Computing

ICAC '06 Proceedings of the 2006 IEEE International Conference on Autonomic Computing
Recent advances in checkpoint/recovery systems

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

Distributed middleware reliability and fault tolerance support in system S

Proceedings of the 5th ACM international conference on Distributed event-based system

Quantified Score

Hi-index	0.00

Visualization

Abstract

Faults in a cluster are inevitable. The larger the cluster, the more likely the occurrence of some failure in hardware, in software, or by human error. System-S software must detect and self-repair failures while carrying out its prime directive--enabling stream processing program fragments to be distributed and connected to form complex applications. Depending on the type of failure, System-S may be able to continue with little or no disruption to potentially tens of thousands of interdependent and heterogeneous program fragments running across thousands of nodes.We extend the work we previously presented on the self healing nature of the job manager component in System-S by presenting how it can handle failures of other system components, applications and network infrastructure. We also evaluate the recoverability of the job management orchestrator component of System-S, considering crash failures with and without error propagation.