Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Actors: a model of concurrent computation in distributed systems
Actors: a model of concurrent computation in distributed systems
Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Optimal checkpointing and local recording for domino-free rollback recovery
Information Processing Letters
Concurrent object-oriented programming in Act 1
Object-oriented concurrent programming
Distributed computing in ABCL/1
Object-oriented concurrent programming
A dynamic load balancer on the Intel hypercube
C3P Proceedings of the third conference on Hypercube concurrent computers and applications: Architecture, software, computer systems, and general issues - Volume 1
Moose: a multi-tasking operating system of hypercubes
C3P Proceedings of the third conference on Hypercube concurrent computers and applications: Architecture, software, computer systems, and general issues - Volume 1
Efficient distributed recovery using message logging
Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Recoverable Distributed Shared Virtual Memory
IEEE Transactions on Computers
Recovery in distributed systems using optimistic message logging and check-pointing
Journal of Algorithms
ACT++: building a concurrent C++ with actors
Journal of Object-Oriented Programming
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
CHARM++: a portable concurrent object oriented system based on C++
OOPSLA '93 Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
CLAP: an object-oriented programming system for distributed memory parallel machines
ACM SIGPLAN OOPS Messenger
A class library approach to concurrent object-oriented programming with applications to VLSI CAD
A class library approach to concurrent object-oriented programming with applications to VLSI CAD
OOPWORK '86 Proceedings of the 1986 SIGPLAN workshop on Object-oriented programming
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Issues in the Design and Implementation of Act2
Issues in the Design and Implementation of Act2
A performance model of highly available multicomputer systems
International Journal of Modelling and Simulation
Hi-index | 0.00 |
Abstract: Ideally, a multicomputer system should cope with a processor failure by reconstructing itself-and the application running on itself-in order to maintain the available computational power of the remaining processors. We discuss the continuance of running applications through permanent processor failures. We take advantage of the characteristics of the actor model of parallel computation and dynamically checkpoint the activity of the application. Consequently, the runtime system is able to continue an application through multiple nonconcurrent processor failures. We have implemented our techniques through modifications of the runtime system of the parallel language Charm on an Intel iPSC/s hypercube. After discussing the theory and implementation, we give measurements of overhead due to fault tolerance for a number of applications and demonstrate continuance of the applications after injection of one or more faults.