A study of unpredictability in fault-tolerant middleware

  • Authors:
  • Tudor Dumitraş;Priya Narasimhan

  • Affiliations:
  • Department of Electrical & Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, United States;Department of Electrical & Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, United States

  • Venue:
  • Computer Networks: The International Journal of Computer and Telecommunications Networking
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

In enterprise applications relying on fault-tolerant middleware, it is a common engineering practice to establish service-level agreements (SLAs) based on the 95th or the 99th percentiles of the latency, to allow a margin for unexpected variability. However, the extent of this unpredictability has not been studied systematically. We present an extensive empirical study of unpredictability in 16 distributed systems, ranging from simple transport protocols to fault-tolerant, middleware-based enterprise applications, and we show that the inherent unpredictability in the systems examined arises from at most 1% of the remote invocations. In the normal, fault-free operating mode most remote invocations have a predictable end-to-end latency, but the maximum latency follows unpredictable trends and is comparable with the time needed to recover from a fault. The maximum latency is not influenced by the system's workload, cannot be regulated through configuration parameters and is not correlated with the system's resource consumption. The high-latency outliers (up to three orders of magnitude higher than the average latency) have multiple causes and may originate in any component of the system. However, after filtering out 1% of the invocations with the highest recorded response-times, the latency becomes bounded with high statistical confidence (p