Implementing fault-tolerant services using the state machine approach: a tutorial
ACM Computing Surveys (CSUR)
Replay, recovery, replication, and snapshots of nondeterministic concurrent programs
PODC '91 Proceedings of the tenth annual ACM symposium on Principles of distributed computing
Hypervisor-based fault tolerance
ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
Distributed systems (2nd Ed.)
X-ability: a theory of replication
Proceedings of the nineteenth annual ACM symposium on Principles of distributed computing
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Automated Software Engineering
A Low Latency, Loss Tolerant Architecture and Protocol for Wide Area Group Communication
DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Supporting nondeterministic execution in fault-tolerant systems
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
TFT: A Software System for Application-Transparent Fault Tolerance
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Deterministic Scheduling for Transactional Multithreaded Replicas
SRDS '00 Proceedings of the 19th IEEE Symposium on Reliable Distributed Systems
NCA '01 Proceedings of the IEEE International Symposium on Network Computing and Applications (NCA'01)
An integrated experimental environment for distributed systems and networks
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
A Multi-Level Meta-Object Protocol for Fault-Tolerance in Complex Architectures
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Nondeterminism in ORBs: The Perception and the Reality
DEXA '06 Proceedings of the 17th International Conference on Database and Expert Systems Applications
Living with nondeterminism in replicated middleware applications
Proceedings of the ACM/IFIP/USENIX 2006 International Conference on Middleware
Managing self-inflicted nondeterminism
HotDep'05 Proceedings of the First conference on Hot topics in system dependability
Hi-index | 0.00 |
When distributed applications are replicated for fault tolerance, the presence of even a single nondeterministic service can lead to emergent system-wide nondeterminism that compromises replica consistency. Our approach, Midas identifies and addresses multiple sources of nondeterminism (including system calls, multithreading, etc.) in a multi-service replicated distributed architecture. Midas involves a synergistic combination of compile-time dependency, concurrency and nondeterminism analyses, followed by the performance-sensitive compensation of nondeterminism at runtime. This approach upholds existing application semantics and allows services to continue to be nondeterministic, while yet maintaining their replicas consistent. We demonstrate Midas' scalability through a microbenchmark that shows the underlying tradeoffs under different kinds of dependencies between clients, services and invocations in a distributed system. We also validate our claims by modeling a representative multi-service application using Java Pathfinder.