Implementation of resilient, atomic data types
ACM Transactions on Programming Languages and Systems (TOPLAS) - Lecture notes in computer science Vol. 174
How to write parallel programs: a first course
How to write parallel programs: a first course
Implementing recoverable requests using queues
SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Transparent fault-tolerance in parallel Orca programs
SEDMS III Papers from the symposium on Experiences with distributed and multiprocessor systems
Fault-tolerant parallel programming in Argus
Concurrency: Practice and Experience
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Highly parallel computing (2nd ed.)
Highly parallel computing (2nd ed.)
RAID: high-performance, reliable secondary storage
ACM Computing Surveys (CSUR)
The PVM concurrent computing system: evolution, experiences, and trends
Parallel Computing - Special issue: message passing interfaces
Concurrent matrix factorizations on workstation networks
Parallel computation
Supporting fault-tolerant parallel programming in Linda
Supporting fault-tolerant parallel programming in Linda
Fault-tolerant parallel processing combining Linda, checkpointing, and transactions
Fault-tolerant parallel processing combining Linda, checkpointing, and transactions
Transaction Processing: Concepts and Techniques
Transaction Processing: Concepts and Techniques
Rajdoot: A Remote Procedure Call Mechanism Supporting Orphan Detection and Killing
IEEE Transactions on Software Engineering
VIP-FS: a VIrtual, Parallel File System for high performance parallel and distributed computing
IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Managing Checkpoints for Parallel Programs
IPPS '96 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Publishing: a reliable broadcast communication mechanism
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
CALYPSO: a novel software system for fault-tolerant parallel processing on distributed platforms
HPDC '95 Proceedings of the 4th IEEE International Symposium on High Performance Distributed Computing
Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery
Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery
Algorithm-Based Diskless Checkpointing for Fault-Tolerant Matrix Operations
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Parallel processing on networks of workstations: a fault-tolerant, high performance approach
ICDCS '95 Proceedings of the 15th International Conference on Distributed Computing Systems
A System for Fault-Tolerant Execution of Data and Compute Intensive Programs Over a Network of Workstations
Understanding Non-Blocking Atomic Commitment
Understanding Non-Blocking Atomic Commitment
Hi-index | 0.03 |
There are many techniques supporting execution of large computations over a network of workstations (NOW) but data intensive computations are usually run on high performance parallel machines. A NOW comprising individual user's machines typically has a low performance interconnect and suffers arbitrary changes of availability. Exploiting such resources to execute data intensive computations is difficult, but even in a more constrained environment there is an unfulfilled need for fault-tolerance. The structuring approach presented fulfills this need. Performance exceeding 100~Mflop/s is demonstrated for large fault-tolerant out of core examples of matrix multiplication and Cholesky factorisation using five 133~MHz Pentium compute machines.