A study of unpredictability in fault-tolerant middleware

Authors:
Tudor Dumitraş;Priya Narasimhan
Affiliations:
Department of Electrical & Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, United States;Department of Electrical & Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, United States
Venue:
Computer Networks: The International Journal of Computer and Telecommunications Networking
Year:
2013

Citing 26
Cited 0

Implementing fault-tolerant services using the state machine approach: a tutorial

ACM Computing Surveys (CSUR)
Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment

Journal of the ACM (JACM)
Latency analysis of the totem single-ring protocol

IEEE/ACM Transactions on Networking (TON)
Linkers and Loaders

Linkers and Loaders
SCTP: New Transport Protocol for TCP/IP

IEEE Internet Computing
TPC-W: A Benchmark for E-Commerce

IEEE Internet Computing
A Low Latency, Loss Tolerant Architecture and Protocol for Wide Area Group Communication

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
UNIX Network Programming, Vol. 1

UNIX Network Programming, Vol. 1
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Experiences, Strategies, and Challenges in Building Fault-Tolerant CORBA Systems

IEEE Transactions on Computers
Exploring adaptability of secure group communication using formal prototyping techniques

ARM '04 Proceedings of the 3rd workshop on Adaptive and reflective middleware
An integrated experimental environment for distributed systems and networks

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
MEAD: support for Real-Time Fault-Tolerant CORBA: Research Articles

Concurrency and Computation: Practice & Experience - Foundations of Middleware Technologies
Continuous resource monitoring for self-predicting DBMS

MASCOTS '05 Proceedings of the 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems
End-to-end latency of a fault-tolerant CORBA infrastructure

Performance Evaluation
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Got predictability?: experiences with fault-tolerant middleware

Proceedings of the 2007 ACM/IFIP/USENIX international conference on Middleware companion
Informed data distribution selection in a self-predicting storage system

ICAC '06 Proceedings of the 2006 IEEE International Conference on Autonomic Computing
I-RMI: performance isolation in information flow applications

Proceedings of the ACM/IFIP/USENIX 2005 International Conference on Middleware
Fault-tolerant middleware and the magical 1%

Proceedings of the ACM/IFIP/USENIX 2005 International Conference on Middleware
The JBoss extensible server

Proceedings of the ACM/IFIP/USENIX 2003 International Conference on Middleware
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Reining in the outliers in map-reduce clusters using Mantri

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Architecting and implementing versatile dependability

Architecting Dependable Systems III
The design of the TAO real-time object request broker

Computer Communications

Quantified Score

Hi-index	0.00

Visualization

Abstract

In enterprise applications relying on fault-tolerant middleware, it is a common engineering practice to establish service-level agreements (SLAs) based on the 95th or the 99th percentiles of the latency, to allow a margin for unexpected variability. However, the extent of this unpredictability has not been studied systematically. We present an extensive empirical study of unpredictability in 16 distributed systems, ranging from simple transport protocols to fault-tolerant, middleware-based enterprise applications, and we show that the inherent unpredictability in the systems examined arises from at most 1% of the remote invocations. In the normal, fault-free operating mode most remote invocations have a predictable end-to-end latency, but the maximum latency follows unpredictable trends and is comparable with the time needed to recover from a fault. The maximum latency is not influenced by the system's workload, cannot be regulated through configuration parameters and is not correlated with the system's resource consumption. The high-latency outliers (up to three orders of magnitude higher than the average latency) have multiple causes and may originate in any component of the system. However, after filtering out 1% of the invocations with the highest recorded response-times, the latency becomes bounded with high statistical confidence (p