Supporting fault-tolerance in heterogeneous distributed applications

Authors:
P. Maheshwari;J. Ouyang
Affiliations:
-;-
Venue:
HCW '97 Proceedings of the 6th Heterogeneous Computing Workshop (HCW '97)
Year:
1997

Citing 13
Cited 0

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Recovery in distributed systems using optimistic message logging and check-pointing

Journal of Algorithms
PVM: a framework for parallel distributed computing

Concurrency: Practice and Experience
Efficient algorithms for distributed snapshots and global virtual time approximation

Journal of Parallel and Distributed Computing - Special issue on parallel and discrete event simulation
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
ickp: A Consistent Checkpointer for Multicomputers

IEEE Parallel & Distributed Technology: Systems & Technology
Guest Editor's Introduction: Heterogeneous Processing

Computer
Heterogeneous Computing: Challenges and Opportunities

Computer
The performance of consistent checkpointing in distributed shared memory systems

SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
Completely Asynchronous Optimistic Recovery with Minimal Rollbacks

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Checkpointing and Recovery for Distributed Shared Memory Applications

IWOOOS '95 Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems
Libckpt: transparent checkpointing under Unix

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings

Quantified Score

Hi-index	0.01

Visualization

Abstract

Heterogeneous computing opens up new challenges and opportunities in fields such as parallel and distributed processing, design of algorithms for applications, scheduling of parallel tasks, interconnection network technology and support for reliable distributed heterogeneous computing. A trend of supporting fault-tolerance in distributed computing systems is to incorporate fault-tolerance into applications at low cost, in terms of both run time performance and programming effort required to construct reliable application software. We present an approach for developing efficient reliable distributed applications for heterogeneous computing systems. We propose a library prototype, called H-Libra, to support fault-tolerance in heterogeneous systems with low run-time cost. Fault-tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level network communication protocol. By employing novel mechanisms, minimum communication overhead is involved for taking a consistent distributed checkpoint and catching messages in transit during a checkpoint. By providing fault-tolerance transparency and a simple, easy to use high-level message-passing interface, H-Libra simplifies the development of reliable heterogeneous distributed applications.