Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Recovery in distributed systems using optimistic message logging and check-pointing
Journal of Algorithms
PVM: a framework for parallel distributed computing
Concurrency: Practice and Experience
Efficient algorithms for distributed snapshots and global virtual time approximation
Journal of Parallel and Distributed Computing - Special issue on parallel and discrete event simulation
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
ickp: A Consistent Checkpointer for Multicomputers
IEEE Parallel & Distributed Technology: Systems & Technology
The performance of consistent checkpointing in distributed shared memory systems
SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
Completely Asynchronous Optimistic Recovery with Minimal Rollbacks
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Checkpointing and Recovery for Distributed Shared Memory Applications
IWOOOS '95 Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Hi-index | 0.01 |
Heterogeneous computing opens up new challenges and opportunities in fields such as parallel and distributed processing, design of algorithms for applications, scheduling of parallel tasks, interconnection network technology and support for reliable distributed heterogeneous computing. A trend of supporting fault-tolerance in distributed computing systems is to incorporate fault-tolerance into applications at low cost, in terms of both run time performance and programming effort required to construct reliable application software. We present an approach for developing efficient reliable distributed applications for heterogeneous computing systems. We propose a library prototype, called H-Libra, to support fault-tolerance in heterogeneous systems with low run-time cost. Fault-tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level network communication protocol. By employing novel mechanisms, minimum communication overhead is involved for taking a consistent distributed checkpoint and catching messages in transit during a checkpoint. By providing fault-tolerance transparency and a simple, easy to use high-level message-passing interface, H-Libra simplifies the development of reliable heterogeneous distributed applications.