Fault-tolerant middleware and the magical 1%

Authors:
Tudor Dumitraş;Priya Narasimhan
Affiliations:
Carnegie Mellon University, Pittsburgh, PA;Carnegie Mellon University, Pittsburgh, PA
Venue:
Proceedings of the ACM/IFIP/USENIX 2005 International Conference on Middleware
Year:
2005

Citing 9
Cited 3

Reliable computer systems (2nd ed.): design and evaluation

Reliable computer systems (2nd ed.): design and evaluation
A Low Latency, Loss Tolerant Architecture and Protocol for Wide Area Group Communication

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
End-to-End Latency of a Fault-Tolerant CORBA Infrastructure

ISORC '02 Proceedings of the Fifth IEEE International Symposium on Object-Oriented Real-Time Distributed Computing
Experiences, Strategies, and Challenges in Building Fault-Tolerant CORBA Systems

IEEE Transactions on Computers
An integrated experimental environment for distributed systems and networks

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
CCMPerf: A Benchmarking Tool for CORBA Component Model Implementations

Real-Time Systems
MEAD: support for Real-Time Fault-Tolerant CORBA: Research Articles

Concurrency and Computation: Practice & Experience - Foundations of Middleware Technologies
Architecting and implementing versatile dependability

Architecting Dependable Systems III
The design of the TAO real-time object request broker

Computer Communications

Got predictability?: experiences with fault-tolerant middleware

Proceedings of the 2007 ACM/IFIP/USENIX international conference on Middleware companion
Workload decomposition for power efficient storage systems

HotPower'08 Proceedings of the 2008 conference on Power aware computing and systems
A study of unpredictability in fault-tolerant middleware

Computer Networks: The International Journal of Computer and Telecommunications Networking

Quantified Score

Hi-index	0.00

Visualization

Abstract

Through an extensive experimental analysis of over 900 possible configurations of a fault-tolerant middleware system, we present empirical evidence that the unpredictability inherent in such systems arises from merely 1% of the remote invocations. The occurrence of very high latencies cannot be regulated through parameters such as the number of clients, the replication style and degree or the request rates. However, by selectively filtering out a "magical 1%" of the raw observations of various metrics, we show that performance, in terms of measured end-to-end latency and throughput, can be bounded, easy to understand and control. This simple statistical technique enables us to guarantee, with some level of confidence, bounds for percentile-based quality of service (QoS) metrics, which dramatically increase our ability to tune and control a middleware system in a predictable manner.