A capacity planning process for performance assurance of component-based distributed systems
Proceedings of the 2nd ACM/SPEC International Conference on Performance engineering
PAL: Propagation-aware Anomaly Localization for cloud hosted distributed applications
SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Understanding and detecting real-world performance bugs
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Zoolander: efficient latency management in NoSQL stores
Proceedings of the Posters and Demo Track
Hi-index | 0.00 |
Subtle implementation errors or mis-configurations in complex Internet services may lead to performance degradations without causing failures. These undiscovered performance anomalies afflict many of today’s systems, causing violations of service-level agreements (SLAs), unnecessary resource over provisioning, or both. In this paper, we re-inserted realistic anomaly causes into a multi-tier Internet service architecture and studied their manifestations. We observed that each cause had certain workload and management parameters that were more likely to trigger manifestations, hinting that such parameters could be effective classifiers. This observation held even when anomaly causes manifested differently in combination than in isolation. Our study motivates EntomoModel, a framework for depicting performance anomaly manifestations. EntomoModel uses decision tree classification and a design-driven performance model to characterize the workload and management policy settings under which manifestations are likely. EntomoModel enables online system management that avoids anomaly manifestations by dynamically adjusting system management parameters. Our trace-driven evaluations show that manifestation avoidance based on EntomoModel, or entomophobic management, can reduce 98th percentile SLA violations by 67% compared to an anomaly oblivious adaptive approach. In a cloud computing scenario with elastic resource allocation, our approach uses less than half of the resources needed in static over-provisioning.