Fault-Tolerant Parallel Applications Using Queues and Actions

Authors:
J. A. Smith;Santosh K. Shrivastava
Affiliations:
-;-
Venue:
ICPP '97 Proceedings of the international Conference on Parallel Processing
Year:
1997

Citing 25
Cited 0

Implementation of resilient, atomic data types

ACM Transactions on Programming Languages and Systems (TOPLAS) - Lecture notes in computer science Vol. 174
How to write parallel programs: a first course

How to write parallel programs: a first course
Implementing recoverable requests using queues

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Transparent fault-tolerance in parallel Orca programs

SEDMS III Papers from the symposium on Experiences with distributed and multiprocessor systems
Fault-tolerant parallel programming in Argus

Concurrency: Practice and Experience
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Highly parallel computing (2nd ed.)

Highly parallel computing (2nd ed.)
High-performance I/O for massively parallel computers: problems and prospects

Computer
RAID: high-performance, reliable secondary storage

ACM Computing Surveys (CSUR)
The PVM concurrent computing system: evolution, experiences, and trends

Parallel Computing - Special issue: message passing interfaces
Concurrent matrix factorizations on workstation networks

Parallel computation
Supporting fault-tolerant parallel programming in Linda

Supporting fault-tolerant parallel programming in Linda
Fault-tolerant parallel processing combining Linda, checkpointing, and transactions

Fault-tolerant parallel processing combining Linda, checkpointing, and transactions
Transaction Processing: Concepts and Techniques

Transaction Processing: Concepts and Techniques
Rajdoot: A Remote Procedure Call Mechanism Supporting Orphan Detection and Killing

IEEE Transactions on Software Engineering
VIP-FS: a VIrtual, Parallel File System for high performance parallel and distributed computing

IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Managing Checkpoints for Parallel Programs

IPPS '96 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Publishing: a reliable broadcast communication mechanism

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
CALYPSO: a novel software system for fault-tolerant parallel processing on distributed platforms

HPDC '95 Proceedings of the 4th IEEE International Symposium on High Performance Distributed Computing
Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery

Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery
Algorithm-Based Diskless Checkpointing for Fault-Tolerant Matrix Operations

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Parallel processing on networks of workstations: a fault-tolerant, high performance approach

ICDCS '95 Proceedings of the 15th International Conference on Distributed Computing Systems
A System for Fault-Tolerant Execution of Data and Compute Intensive Programs Over a Network of Workstations

A System for Fault-Tolerant Execution of Data and Compute Intensive Programs Over a Network of Workstations
Understanding Non-Blocking Atomic Commitment

Understanding Non-Blocking Atomic Commitment

Quantified Score

Hi-index	0.03

Visualization

Abstract

There are many techniques supporting execution of large computations over a network of workstations (NOW) but data intensive computations are usually run on high performance parallel machines. A NOW comprising individual user's machines typically has a low performance interconnect and suffers arbitrary changes of availability. Exploiting such resources to execute data intensive computations is difficult, but even in a more constrained environment there is an unfulfilled need for fault-tolerance. The structuring approach presented fulfills this need. Performance exceeding 100~Mflop/s is demonstrated for large fault-tolerant out of core examples of matrix multiplication and Cholesky factorisation using five 133~MHz Pentium compute machines.