Handling Emergent Nondeterminism in Replicated Services

Authors:
Joseph Slember;Priya Narasimhan
Affiliations:
Carnegie Mellon University, Pittsburgh, USA PA 15213;Carnegie Mellon University, Pittsburgh, USA PA 15213
Venue:
Architecting Dependable Systems V
Year:
2008

Citing 17
Cited 0

Implementing fault-tolerant services using the state machine approach: a tutorial

ACM Computing Surveys (CSUR)
Replay, recovery, replication, and snapshots of nondeterministic concurrent programs

PODC '91 Proceedings of the tenth annual ACM symposium on Principles of distributed computing
Hypervisor-based fault tolerance

ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
The primary-backup approach

Distributed systems (2nd Ed.)
X-ability: a theory of replication

Proceedings of the nineteenth annual ACM symposium on Principles of distributed computing
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Model Checking Programs

Automated Software Engineering
A Low Latency, Loss Tolerant Architecture and Protocol for Wide Area Group Communication

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Supporting nondeterministic execution in fault-tolerant systems

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
TFT: A Software System for Application-Transparent Fault Tolerance

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Deterministic Scheduling for Transactional Multithreaded Replicas

SRDS '00 Proceedings of the 19th IEEE Symposium on Reliable Distributed Systems
Tapping TCP Streams

NCA '01 Proceedings of the IEEE International Symposium on Network Computing and Applications (NCA'01)
An integrated experimental environment for distributed systems and networks

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
A Multi-Level Meta-Object Protocol for Fault-Tolerance in Complex Architectures

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Nondeterminism in ORBs: The Perception and the Reality

DEXA '06 Proceedings of the 17th International Conference on Database and Expert Systems Applications
Living with nondeterminism in replicated middleware applications

Proceedings of the ACM/IFIP/USENIX 2006 International Conference on Middleware
Managing self-inflicted nondeterminism

HotDep'05 Proceedings of the First conference on Hot topics in system dependability

Quantified Score

Hi-index	0.00

Visualization

Abstract

When distributed applications are replicated for fault tolerance, the presence of even a single nondeterministic service can lead to emergent system-wide nondeterminism that compromises replica consistency. Our approach, Midas identifies and addresses multiple sources of nondeterminism (including system calls, multithreading, etc.) in a multi-service replicated distributed architecture. Midas involves a synergistic combination of compile-time dependency, concurrency and nondeterminism analyses, followed by the performance-sensitive compensation of nondeterminism at runtime. This approach upholds existing application semantics and allows services to continue to be nondeterministic, while yet maintaining their replicas consistent. We demonstrate Midas' scalability through a microbenchmark that shows the underlying tradeoffs under different kinds of dependencies between clients, services and invocations in a distributed system. We also validate our claims by modeling a representative multi-service application using Java Pathfinder.