Understanding “why” in software process modelling, analysis, and design
ICSE '94 Proceedings of the 16th international conference on Software engineering
Pinpoint: Problem Determination in Large, Dynamic Internet Services
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
On the distributed fault diagnosis of computer networks
ISCC '95 Proceedings of the IEEE Symposium on Computers and Communications (ISCC'95)
Performance debugging for distributed systems of black boxes
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Shrink: a tool for failure diagnosis in IP networks
Proceedings of the 2005 ACM SIGCOMM workshop on Mining network data
Detecting causal relationships in distributed computations: in search of the holy grail
Distributed Computing
Capturing, indexing, clustering, and retrieving system history
Proceedings of the twentieth ACM symposium on Operating systems principles
WAP5: black-box performance debugging for wide-area systems
Proceedings of the 15th international conference on World Wide Web
Stanley: The robot that won the DARPA Grand Challenge: Research Articles
Journal of Robotic Systems - Special Issue on the DARPA Grand Challenge, Part 2
Emergent (mis)behavior vs. complex software systems
Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Using queries for distributed monitoring and forensics
Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Path-based faliure and evolution management
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
IP fault localization via risk modeling
NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Using magpie for request extraction and workload modelling
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
What Supercomputers Say: A Study of Five System Logs
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Replay debugging for distributed applications
ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Pip: detecting the unexpected in distributed systems
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Towards highly reliable enterprise network services via inference of multi-level dependencies
Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications
D3S: debugging deployed distributed systems
NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Junior: The Stanford entry in the Urban Challenge
Journal of Field Robotics - Special Issue on the 2007 DARPA Urban Challenge, Part II
Alert Detection in System Logs
ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Debugging in the (very) large: ten years of implementation and experience
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Detecting large-scale system problems by mining console logs
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
WiDS checker: combating bugs in distributed systems
NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Hi-index | 0.00 |
When something unexpected happens in a large production system, administrators must first perform a search to isolate which components and component interactions are likely to be involved. The system may consist of thousands of interacting subsystems, the logging instrumentation may be noisy or incomplete, and the problem description may be vague, so this search is often the most difficult part of understanding the system's behavior. To facilitate the search process, we present a query language and a method for computing these queries that makes minimal assumptions about the available data. We evaluate our method on nearly 1.22 billion lines of system logs from four supercomputers, two autonomous vehicles, and a server cluster.