SEDA: an architecture for well-conditioned, scalable internet services
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
ACM Transactions on Computer Systems (TOCS)
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Automatic misconfiguration troubleshooting with peerpressure
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
An analysis of latent sector errors in disk drives
Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Argon: performance insulation for shared storage servers
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
AjaxScope: a platform for remotely monitoring the client-side behavior of web 2.0 applications
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
DARC: dynamic analysis of root causes of latency distributions
SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
PARDA: proportional allocation of resources for distributed storage access
FAST '09 Proccedings of the 7th conference on File and storage technologies
DRAM errors in the wild: a large-scale field study
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Detecting large-scale system problems by mining console logs
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Benchmarking cloud serving systems with YCSB
Proceedings of the 1st ACM symposium on Cloud computing
Characterizing cloud computing hardware reliability
Proceedings of the 1st ACM symposium on Cloud computing
Black-box problem diagnosis in parallel file systems
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Improving MapReduce performance in heterogeneous environments
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
ICTCP: Incast Congestion Control for TCP in data center networks
Proceedings of the 6th International COnference
Availability in globally distributed storage systems
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Reining in the outliers in map-reduce clusters using Mantri
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Diagnosing performance changes by comparing request flows
Proceedings of the 8th USENIX conference on Networked systems design and implementation
Understanding network failures in data centers: measurement, analysis, and implications
Proceedings of the ACM SIGCOMM 2011 conference
YCSB++: benchmarking and performance debugging advanced features in scalable table stores
Proceedings of the 2nd ACM Symposium on Cloud Computing
Pesto: online storage performance management in virtualized datacenters
Proceedings of the 2nd ACM Symposium on Cloud Computing
Making time-stepped applications tick in the cloud
Proceedings of the 2nd ACM Symposium on Cloud Computing
Fast crash recovery in RAMCloud
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
The Case for Evaluating MapReduce Performance Using Workload Suites
MASCOTS '11 Proceedings of the 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems
Understanding and detecting real-world performance bugs
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
MegaPipe: a new programming interface for scalable network I/O
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
X-ray: automating root-cause diagnosis of performance anomalies in production software
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Performance isolation and fairness for multi-tenant cloud storage
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Cake: enabling high-level SLOs on shared storage systems
Proceedings of the Third ACM Symposium on Cloud Computing
CPI2: CPU performance isolation for shared compute clusters
Proceedings of the 8th ACM European Conference on Computer Systems
Effective straggler mitigation: attack of the clones
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
On fault resilience of OpenStack
Proceedings of the 4th annual Symposium on Cloud Computing
Hi-index | 0.00 |
We highlight one often-overlooked cause of performance failure: limpware -- "limping" hardware whose performance degrades significantly compared to its specification. We report anecdotes of degraded disks and network components seen in large-scale production. To measure the system-level impact of limpware, we assembled limpbench, a set of benchmarks that combine data-intensive load and limpware injections. We benchmark five cloud systems (Hadoop, HDFS, ZooKeeper, Cassandra, and HBase) and find that limpware can severely impact distributed operations, nodes, and an entire cluster. From this, we introduce the concept of limplock, a situation where a system progresses slowly due to the presence of limpware and is not capable of failing over to healthy components. We show how each cloud system that we analyze can exhibit operation, node, and cluster limplock. We conclude that many cloud systems are not limpware tolerant.