Got predictability?: experiences with fault-tolerant middleware

Authors:
Tudor Dumitraş;Priya Narasimhan
Affiliations:
Carnegie Mellon University, Pittsburgh PA;Carnegie Mellon University, Pittsburgh PA
Venue:
Proceedings of the 2007 ACM/IFIP/USENIX international conference on Middleware companion
Year:
2007

Citing 10
Cited 1

Quantitative system performance: computer system analysis using queueing network models

Quantitative system performance: computer system analysis using queueing network models
Implementing fault-tolerant services using the state machine approach: a tutorial

ACM Computing Surveys (CSUR)
The primary-backup approach

Distributed systems (2nd Ed.)
Experiences, Strategies, and Challenges in Building Fault-Tolerant CORBA Systems

IEEE Transactions on Computers
CCMPerf: A Benchmarking Tool for CORBA Component Model Implementations

Real-Time Systems
MEAD: support for Real-Time Fault-Tolerant CORBA: Research Articles

Concurrency and Computation: Practice & Experience - Foundations of Middleware Technologies
Fault-tolerance for Stateful Application Servers in the Presence of Advanced Transactions Patterns

SRDS '05 Proceedings of the 24th IEEE Symposium on Reliable Distributed Systems
End-to-end latency of a fault-tolerant CORBA infrastructure

Performance Evaluation
Fault-tolerant middleware and the magical 1%

Proceedings of the ACM/IFIP/USENIX 2005 International Conference on Middleware
Architecting and implementing versatile dependability

Architecting Dependable Systems III

A study of unpredictability in fault-tolerant middleware

Computer Networks: The International Journal of Computer and Telecommunications Networking

Quantified Score

Hi-index	0.00

Visualization

Abstract

Unpredictability in COTS-based systems often manifests as occasional instances of uncontrollably-high response times. A particular category of COTS systems, fault-tolerant (FT) middleware, is used in critical enterprise and embedded applications where predictability is of paramount importance. Our prior empirical study, which used a client-server microbenchmark, suggested that hard bounds for the maximum latency are hard to establish a priori, but that the unpredictability may be confined to less than 1% of the requests. In this paper, we present empirical data, from 7 different three-tier, FT-middleware applications, that shows strong evidence supporting this "magical 1%" hypothesis. We conducted a controlled experiment with 7 teams of students from a graduate-level course at Carnegie Mellon University. Each team, starting from a common three-tier architecture, independently implemented and evaluated an original application using middleware (either CORBA or EJB) and a custom-implemented fault-tolerance mechanism (relying on either state-machine or primary-backup replication) for the middle-tier server. This experiment shows that unpredictability may not be avoidable, even in the absence of faults, and that, in some cases, the random latency outliers are larger than the time needed to recover from a fault. The data also reveals a statistically-significant result that, across all 7 applications, unpredictability is confined to the highest 1% of the recorded end-to-end latencies and is not correlated with the request rate, the size of messages exchanged or the number of clients. This suggests that strict predictability is hard to achieve in FT-middleware systems and that developers of critical FT applications should focus on guaranteeing bounds for statistical measures, such as the 99th percentile of the latency.