Finding latent performance bugs in systems implementations

Authors:
Charles Killian;Karthik Nagaraj;Salman Pervez;Ryan Braud;James W. Anderson;Ranjit Jhala
Affiliations:
Purdue University, West Lafayette, IN, USA;Purdue University, West Lafayette, IN, USA;Purdue University, West Lafayette, IN, USA;University of California, San Diego, La Jolla, CA, USA;University of California, San Diego, La Jolla, CA, USA;University of California, San Diego, La Jolla, CA, USA
Venue:
Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
Year:
2010

Citing 28
Cited 9

Model checking for programming languages using VeriSoft

Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
The part-time parliament

ACM Transactions on Computer Systems (TOCS)
Yesterday, my program worked. Today, it does not. Why?

ESEC/FSE-7 Proceedings of the 7th European software engineering conference held jointly with the 7th ACM SIGSOFT international symposium on Foundations of software engineering
Chord: A scalable peer-to-peer lookup service for internet applications

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Bugs as deviant behavior: a general approach to inferring errors in systems code

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems

Middleware '01 Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg
SplitStream: high-bandwidth multicast in cooperative environments

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Vivaldi: a decentralized network coordinate system

Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications
CMC: a pragmatic approach to model checking real code

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Scalability and accuracy in a large-scale network emulator

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
DART: directed automated random testing

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Locating faults through automated predicate switching

Proceedings of the 28th international conference on Software engineering
Maintaining high bandwidth under dynamic network conditions

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Handling churn in a DHT

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Mace: language support for building distributed systems

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Iterative context bounding for systematic testing of multithreaded programs

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Overcast: reliable multicasting with on overlay network

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Using random subsets to build scalable network services

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Pip: detecting the unexpected in distributed systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Measuring empirical computational complexity

Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
Fair stateless model checking

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
MODIST: transparent model checking of unmodified distributed systems

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
KLEE: unassisted and automatic generation of high-coverage tests for complex systems programs

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Finding and reproducing Heisenbugs in concurrent programs

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Life, death, and the critical transition: finding liveness bugs in systems code

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
WiDS checker: combating bugs in distributed systems

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
X-trace: a pervasive network tracing framework

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Friday: global comprehension for distributed replay

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation

InContext: simple parallelism for distributed applications

Proceedings of the 20th international symposium on High performance distributed computing
Structured comparative analysis of systems logs to diagnose performance problems

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Understanding and detecting real-world performance bugs

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Composable reliability for asynchronous systems

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Programming model support for dependable, elastic cloud applications

HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
Program performance spectrum

Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
Toddler: detecting performance problems via similar memory-access patterns

Proceedings of the 2013 International Conference on Software Engineering
Discovering, reporting, and fixing performance bugs

Proceedings of the 10th Working Conference on Mining Software Repositories
Aspen trees: balancing data center fault tolerance, scalability and cost

Proceedings of the ninth ACM conference on Emerging networking experiments and technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Robust distributed systems commonly employ high-level recovery mechanisms enabling the system to recover from a wide variety of problematic environmental conditions such as node failures, packet drops and link disconnections. Unfortunately, these recovery mechanisms also effectively mask additional serious design and implementation errors, disguising them as latent performance bugs that severely degrade end-to-end system performance. These bugs typically go unnoticed due to the challenge of distinguishing between a bug and an intermittent environmental condition that must be tolerated by the system. We present techniques that can automatically pinpoint latent performance bugs in systems implementations, in the spirit of recent advances in model checking by systematic state space exploration. The techniques proceed by automating the process of conducting random simulations, identifying performance anomalies, and analyzing anomalous executions to pinpoint the circumstances leading to performance degradation. By focusing our implementation on the MACE toolkit, MACEPC can be used to test our implementations directly, without modification. We have applied MACEPC to five thoroughly tested and trusted distributed systems implementations. MACEPC was able to find significant, previously unknown, long-standing performance bugs in each of the systems, and led to fixes that significantly improved the end-to-end performance of the systems.