A query language for understanding component interactions in production systems

Authors:
Adam J. Oliner;Alex Aiken
Affiliations:
Stanford University;Stanford University
Venue:
Proceedings of the 24th ACM International Conference on Supercomputing
Year:
2010

Citing 24
Cited 1

Understanding “why” in software process modelling, analysis, and design

ICSE '94 Proceedings of the 16th international conference on Software engineering
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
On the distributed fault diagnosis of computer networks

ISCC '95 Proceedings of the IEEE Symposium on Computers and Communications (ISCC'95)
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Shrink: a tool for failure diagnosis in IP networks

Proceedings of the 2005 ACM SIGCOMM workshop on Mining network data
Detecting causal relationships in distributed computations: in search of the holy grail

Distributed Computing
Capturing, indexing, clustering, and retrieving system history

Proceedings of the twentieth ACM symposium on Operating systems principles
WAP5: black-box performance debugging for wide-area systems

Proceedings of the 15th international conference on World Wide Web
Stanley: The robot that won the DARPA Grand Challenge: Research Articles

Journal of Robotic Systems - Special Issue on the DARPA Grand Challenge, Part 2
Emergent (mis)behavior vs. complex software systems

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Using queries for distributed monitoring and forensics

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Path-based faliure and evolution management

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
IP fault localization via risk modeling

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Using magpie for request extraction and workload modelling

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
What Supercomputers Say: A Study of Five System Logs

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Replay debugging for distributed applications

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Pip: detecting the unexpected in distributed systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Towards highly reliable enterprise network services via inference of multi-level dependencies

Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications
D3S: debugging deployed distributed systems

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Junior: The Stanford entry in the Urban Challenge

Journal of Field Robotics - Special Issue on the 2007 DARPA Urban Challenge, Part II
Alert Detection in System Logs

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Debugging in the (very) large: ten years of implementation and experience

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Detecting large-scale system problems by mining console logs

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
WiDS checker: combating bugs in distributed systems

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation

Secure network provenance

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles

Quantified Score

Hi-index	0.00

Visualization

Abstract

When something unexpected happens in a large production system, administrators must first perform a search to isolate which components and component interactions are likely to be involved. The system may consist of thousands of interacting subsystems, the logging instrumentation may be noisy or incomplete, and the problem description may be vague, so this search is often the most difficult part of understanding the system's behavior. To facilitate the search process, we present a query language and a method for computing these queries that makes minimal assumptions about the available data. We evaluate our method on nearly 1.22 billion lines of system logs from four supercomputers, two autonomous vehicles, and a server cluster.