Quantitative system performance: computer system analysis using queueing network models
Quantitative system performance: computer system analysis using queueing network models
Implementing fault-tolerant services using the state machine approach: a tutorial
ACM Computing Surveys (CSUR)
Distributed systems (2nd Ed.)
Experiences, Strategies, and Challenges in Building Fault-Tolerant CORBA Systems
IEEE Transactions on Computers
CCMPerf: A Benchmarking Tool for CORBA Component Model Implementations
Real-Time Systems
MEAD: support for Real-Time Fault-Tolerant CORBA: Research Articles
Concurrency and Computation: Practice & Experience - Foundations of Middleware Technologies
Fault-tolerance for Stateful Application Servers in the Presence of Advanced Transactions Patterns
SRDS '05 Proceedings of the 24th IEEE Symposium on Reliable Distributed Systems
End-to-end latency of a fault-tolerant CORBA infrastructure
Performance Evaluation
Fault-tolerant middleware and the magical 1%
Proceedings of the ACM/IFIP/USENIX 2005 International Conference on Middleware
Architecting and implementing versatile dependability
Architecting Dependable Systems III
A study of unpredictability in fault-tolerant middleware
Computer Networks: The International Journal of Computer and Telecommunications Networking
Hi-index | 0.00 |
Unpredictability in COTS-based systems often manifests as occasional instances of uncontrollably-high response times. A particular category of COTS systems, fault-tolerant (FT) middleware, is used in critical enterprise and embedded applications where predictability is of paramount importance. Our prior empirical study, which used a client-server microbenchmark, suggested that hard bounds for the maximum latency are hard to establish a priori, but that the unpredictability may be confined to less than 1% of the requests. In this paper, we present empirical data, from 7 different three-tier, FT-middleware applications, that shows strong evidence supporting this "magical 1%" hypothesis. We conducted a controlled experiment with 7 teams of students from a graduate-level course at Carnegie Mellon University. Each team, starting from a common three-tier architecture, independently implemented and evaluated an original application using middleware (either CORBA or EJB) and a custom-implemented fault-tolerance mechanism (relying on either state-machine or primary-backup replication) for the middle-tier server. This experiment shows that unpredictability may not be avoidable, even in the absence of faults, and that, in some cases, the random latency outliers are larger than the time needed to recover from a fault. The data also reveals a statistically-significant result that, across all 7 applications, unpredictability is confined to the highest 1% of the recorded end-to-end latencies and is not correlated with the request rate, the size of messages exchanged or the number of clients. This suggests that strict predictability is hard to achieve in FT-middleware systems and that developers of critical FT applications should focus on guaranteeing bounds for statistical measures, such as the 99th percentile of the latency.