Limplock: understanding the impact of limpware on scale-out cloud systems

Authors:
Thanh Do;Mingzhe Hao;Tanakorn Leesatapornwongsa;Tiratat Patana-anake;Haryadi S. Gunawi
Affiliations:
University of Wisconsin-Madison;University of Chicago;University of Chicago;University of Chicago;University of Chicago
Venue:
Proceedings of the 4th annual Symposium on Cloud Computing
Year:
2013

Citing 34
Cited 1

SEDA: an architecture for well-conditioned, scalable internet services

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Run-time adaptation in river

ACM Transactions on Computer Systems (TOCS)
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Correlating instrumentation data to system states: a building block for automated diagnosis and control

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Automatic misconfiguration troubleshooting with peerpressure

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
An analysis of latent sector errors in disk drives

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Argon: performance insulation for shared storage servers

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
AjaxScope: a platform for remotely monitoring the client-side behavior of web 2.0 applications

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
DARC: dynamic analysis of root causes of latency distributions

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
PARDA: proportional allocation of resources for distributed storage access

FAST '09 Proccedings of the 7th conference on File and storage technologies
DRAM errors in the wild: a large-scale field study

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Detecting large-scale system problems by mining console logs

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Benchmarking cloud serving systems with YCSB

Proceedings of the 1st ACM symposium on Cloud computing
Characterizing cloud computing hardware reliability

Proceedings of the 1st ACM symposium on Cloud computing
Black-box problem diagnosis in parallel file systems

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
ICTCP: Incast Congestion Control for TCP in data center networks

Proceedings of the 6th International COnference
Availability in globally distributed storage systems

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Reining in the outliers in map-reduce clusters using Mantri

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Diagnosing performance changes by comparing request flows

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Understanding network failures in data centers: measurement, analysis, and implications

Proceedings of the ACM SIGCOMM 2011 conference
YCSB++: benchmarking and performance debugging advanced features in scalable table stores

Proceedings of the 2nd ACM Symposium on Cloud Computing
Pesto: online storage performance management in virtualized datacenters

Proceedings of the 2nd ACM Symposium on Cloud Computing
Making time-stepped applications tick in the cloud

Proceedings of the 2nd ACM Symposium on Cloud Computing
Fast crash recovery in RAMCloud

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
The Case for Evaluating MapReduce Performance Using Workload Suites

MASCOTS '11 Proceedings of the 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems
Understanding and detecting real-world performance bugs

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
MegaPipe: a new programming interface for scalable network I/O

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
X-ray: automating root-cause diagnosis of performance anomalies in production software

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Performance isolation and fairness for multi-tenant cloud storage

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Cake: enabling high-level SLOs on shared storage systems

Proceedings of the Third ACM Symposium on Cloud Computing
CPI2: CPU performance isolation for shared compute clusters

Proceedings of the 8th ACM European Conference on Computer Systems
Effective straggler mitigation: attack of the clones

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation

On fault resilience of OpenStack

Proceedings of the 4th annual Symposium on Cloud Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We highlight one often-overlooked cause of performance failure: limpware -- "limping" hardware whose performance degrades significantly compared to its specification. We report anecdotes of degraded disks and network components seen in large-scale production. To measure the system-level impact of limpware, we assembled limpbench, a set of benchmarks that combine data-intensive load and limpware injections. We benchmark five cloud systems (Hadoop, HDFS, ZooKeeper, Cassandra, and HBase) and find that limpware can severely impact distributed operations, nodes, and an entire cluster. From this, we introduce the concept of limplock, a situation where a system progresses slowly due to the presence of limpware and is not capable of failing over to healthy components. We show how each cloud system that we analyze can exhibit operation, node, and cluster limplock. We conclude that many cloud systems are not limpware tolerant.