Measurement and modeling of computer reliability as affected by system activity
ACM Transactions on Computer Systems (TOCS)
A Gate-Level Simulation Environment for Alpha-Particle-Induced Transient Faults
IEEE Transactions on Computers
Fault behavior dictionary for simulation of device-level transients
ICCAD '93 Proceedings of the 1993 IEEE/ACM international conference on Computer-aided design
The Vision of Autonomic Computing
Computer
IEEE Transactions on Software Engineering
Hibernate in Action (In Action series)
Hibernate in Action (In Action series)
Fault-tolerance in the Borealis distributed stream processing system
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Remote Repair of Operating System State Using Backdoors
ICAC '04 Proceedings of the First International Conference on Autonomic Computing
ICAC '05 Proceedings of the Second International Conference on Automatic Computing
Distributed Stream Management using Utility-Driven Self-Adaptive Middleware
ICAC '05 Proceedings of the Second International Conference on Automatic Computing
Design, implementation, and evaluation of the linear road bnchmark on the stream processing core
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Multi-site cooperative data stream analysis
ACM SIGOPS Operating Systems Review
Adaptive Control of Extreme-scale Stream Processing Systems
ICDCS '06 Proceedings of the 26th IEEE International Conference on Distributed Computing Systems
Towards Autonomic Fault Recovery in System-S
ICAC '07 Proceedings of the Fourth International Conference on Autonomic Computing
The Laundromat Model for Autonomic Cluster Computing
ICAC '06 Proceedings of the 2006 IEEE International Conference on Autonomic Computing
Recent advances in checkpoint/recovery systems
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Distributed middleware reliability and fault tolerance support in system S
Proceedings of the 5th ACM international conference on Distributed event-based system
Hi-index | 0.00 |
Faults in a cluster are inevitable. The larger the cluster, the more likely the occurrence of some failure in hardware, in software, or by human error. System-S software must detect and self-repair failures while carrying out its prime directive--enabling stream processing program fragments to be distributed and connected to form complex applications. Depending on the type of failure, System-S may be able to continue with little or no disruption to potentially tens of thousands of interdependent and heterogeneous program fragments running across thousands of nodes.We extend the work we previously presented on the self healing nature of the job manager component in System-S by presenting how it can handle failures of other system components, applications and network infrastructure. We also evaluate the recoverability of the job management orchestrator component of System-S, considering crash failures with and without error propagation.