Transaction management in the R* distributed database management system
ACM Transactions on Database Systems (TODS)
The consensus problem in fault-tolerant computing
ACM Computing Surveys (CSUR)
MPI: a message passing interface
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Unreliable failure detectors for reliable distributed systems
Journal of the ACM (JACM)
ACM Transactions on Computer Systems (TOCS)
SIGMOD '81 Proceedings of the 1981 ACM SIGMOD international conference on Management of data
The Dynamic Two Phase Commitment (D2PC) Protocol
ICDT '95 Proceedings of the 5th International Conference on Database Theory
Notes on Data Base Operating Systems
Operating Systems, An Advanced Course
Efficient commit protocols for the tree of processes model of distributed transactions
ACM SIGOPS Operating Systems Review
Paxos made live: an engineering perspective
Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Algorithm-Based Fault Tolerance for Matrix Operations
IEEE Transactions on Computers
Concurrency Control and Consistency of Multiple Copies of Data in Distributed Ingres
IEEE Transactions on Software Engineering
International Journal of High Performance Computing Applications
Preserving Collective Performance across Process Failure for a Fault Tolerant MPI
IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
Super-Scalable algorithms for computing on 100,000 processors
ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I
Analyzing fault aware collective performance in a process fault tolerant MPI
Parallel Computing
Enabling Application Resilience with and without the MPI Standard
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
An evaluation of user-level failure mitigation support in MPI
EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
User level failure mitigation in MPI
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Hi-index | 0.00 |
The lack of fault tolerance is becoming a limiting factor for application scalability in HPC systems. The MPI does not provide standardized fault tolerance interfaces and semantics. The MPI Forum's Fault Tolerance Working Group is proposing a collective fault tolerant agreement algorithm for the next MPI standard. Such algorithms play a central role in many fault tolerant applications. This paper combines a log-scaling two-phase commit agreement algorithm with a reduction operation to provide the necessary functionality for the new collective without any additional messages. Error handling mechanisms are described that preserve the fault tolerance properties while maintaining overall scalability.