Implementing fault-tolerant services using the state machine approach: a tutorial
ACM Computing Surveys (CSUR)
Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment
Journal of the ACM (JACM)
Latency analysis of the totem single-ring protocol
IEEE/ACM Transactions on Networking (TON)
Linkers and Loaders
SCTP: New Transport Protocol for TCP/IP
IEEE Internet Computing
TPC-W: A Benchmark for E-Commerce
IEEE Internet Computing
A Low Latency, Loss Tolerant Architecture and Protocol for Wide Area Group Communication
DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
UNIX Network Programming, Vol. 1
UNIX Network Programming, Vol. 1
Performance debugging for distributed systems of black boxes
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Experiences, Strategies, and Challenges in Building Fault-Tolerant CORBA Systems
IEEE Transactions on Computers
Exploring adaptability of secure group communication using formal prototyping techniques
ARM '04 Proceedings of the 3rd workshop on Adaptive and reflective middleware
An integrated experimental environment for distributed systems and networks
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
MEAD: support for Real-Time Fault-Tolerant CORBA: Research Articles
Concurrency and Computation: Practice & Experience - Foundations of Middleware Technologies
Continuous resource monitoring for self-predicting DBMS
MASCOTS '05 Proceedings of the 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems
End-to-end latency of a fault-tolerant CORBA infrastructure
Performance Evaluation
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dynamo: amazon's highly available key-value store
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Got predictability?: experiences with fault-tolerant middleware
Proceedings of the 2007 ACM/IFIP/USENIX international conference on Middleware companion
Informed data distribution selection in a self-predicting storage system
ICAC '06 Proceedings of the 2006 IEEE International Conference on Autonomic Computing
I-RMI: performance isolation in information flow applications
Proceedings of the ACM/IFIP/USENIX 2005 International Conference on Middleware
Fault-tolerant middleware and the magical 1%
Proceedings of the ACM/IFIP/USENIX 2005 International Conference on Middleware
Proceedings of the ACM/IFIP/USENIX 2003 International Conference on Middleware
Improving MapReduce performance in heterogeneous environments
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Reining in the outliers in map-reduce clusters using Mantri
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Architecting and implementing versatile dependability
Architecting Dependable Systems III
The design of the TAO real-time object request broker
Computer Communications
Hi-index | 0.00 |
In enterprise applications relying on fault-tolerant middleware, it is a common engineering practice to establish service-level agreements (SLAs) based on the 95th or the 99th percentiles of the latency, to allow a margin for unexpected variability. However, the extent of this unpredictability has not been studied systematically. We present an extensive empirical study of unpredictability in 16 distributed systems, ranging from simple transport protocols to fault-tolerant, middleware-based enterprise applications, and we show that the inherent unpredictability in the systems examined arises from at most 1% of the remote invocations. In the normal, fault-free operating mode most remote invocations have a predictable end-to-end latency, but the maximum latency follows unpredictable trends and is comparable with the time needed to recover from a fault. The maximum latency is not influenced by the system's workload, cannot be regulated through configuration parameters and is not correlated with the system's resource consumption. The high-latency outliers (up to three orders of magnitude higher than the average latency) have multiple causes and may originate in any component of the system. However, after filtering out 1% of the invocations with the highest recorded response-times, the latency becomes bounded with high statistical confidence (p