Increasing network throughput by integrating protocol layers
IEEE/ACM Transactions on Networking (TON)
Experiences with building distributed debuggers
SPDT '96 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Making paths explicit in the Scout operating system
OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Cluster-based scalable network services
Proceedings of the sixteenth ACM symposium on Operating systems principles
Improving data-flow analysis with path profiles
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Internet service performance failure detection
ACM SIGMETRICS Performance Evaluation Review
Bro: a system for detecting network intruders in real-time
Computer Networks: The International Journal of Computer and Telecommunications Networking
The Ninja architecture for robust Internet-scale systems and services373423
Computer Networks: The International Journal of Computer and Telecommunications Networking - pervasive computing
SEDA: an architecture for well-conditioned, scalable internet services
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Computer architecture: a quantitative approach
Computer architecture: a quantitative approach
Lessons from Giant-Scale Services
IEEE Internet Computing
Pinpoint: Problem Determination in Large, Dynamic Internet Services
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
WebMon: A Performance Profiler for Web Transactions
WECWIS '02 Proceedings of the Fourth IEEE International Workshop on Advanced Issues of E-Commerce and Web-Based Information Systems (WECWIS'02)
Web usage mining: discovery and applications of usage patterns from Web data
ACM SIGKDD Explorations Newsletter
ICDCS '99 Proceedings of the 19th IEEE International Conference on Distributed Computing Systems
Intrusion Detection via Static Analysis
SP '01 Proceedings of the 2001 IEEE Symposium on Security and Privacy
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Performance debugging for distributed systems of black boxes
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Fine-grained network time synchronization using reference broadcasts
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
ReVirt: enabling intrusion analysis through virtual-machine logging and replay
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Using runtime paths for macroanalysis
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Checking system rules using system-specific, programmer-written compiler extensions
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Why do internet services fail, and what can be done about it?
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Data mining approaches for intrusion detection
SSYM'98 Proceedings of the 7th conference on USENIX Security Symposium - Volume 7
Virtual services: a new abstraction for server consolidation
ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference
Intrusion detection using sequences of system calls
Journal of Computer Security
High speed and robust event correlation
IEEE Communications Magazine
Deconstructing Commodity Storage Clusters
Proceedings of the 32nd annual international symposium on Computer Architecture
Better performance or better manageability?
DEAS '05 Proceedings of the 2005 workshop on Design and evolution of autonomic application software
Daphne: performance debugging using model-driven anomaly characterization
Proceedings of the twentieth ACM symposium on Operating systems principles
Proceedings of the 11th workshop on ACM SIGOPS European workshop
WAP5: black-box performance debugging for wide-area systems
Proceedings of the 15th international conference on World Wide Web
Automated known problem diagnosis with event traces
Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Using queries for distributed monitoring and forensics
Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
Understanding and dealing with operator mistakes in internet services
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Configuration debugging as search: finding the needle in the haystack
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
FUSE: lightweight guaranteed distributed failure notification
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using magpie for request extraction and workload modelling
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Detecting performance anomalies in global applications
WORLDS'05 Proceedings of the 2nd conference on Real, Large Distributed Systems - Volume 2
Towards highly reliable enterprise network services via inference of multi-level dependencies
Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications
Operating system profiling via latency analysis
OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Architecture-driven diagnosis of performance failures in a token ring
HotDep'07 Proceedings of the 3rd workshop on on Hot Topics in System Dependability
Hardware counter driven on-the-fly request signatures
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Hang analysis: fighting responsiveness bugs
Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
30 seconds is not enough!: a study of operating system timer usage
Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Why did my pc suddenly slow down?
SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
Tracking in a spaghetti bowl: monitoring transactions using footprints
SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
An Operating System Architecture for Future Information Appliances
SEUS '08 Proceedings of the 6th IFIP WG 10.2 international workshop on Software Technologies for Embedded and Ubiquitous Systems
Automatic request categorization in internet services
ACM SIGMETRICS Performance Evaluation Review
Towards realistic file-system benchmarks with CodeMRI
ACM SIGMETRICS Performance Evaluation Review
Reliable framework for RFID devices
Proceedings of the 5th Middleware doctoral symposium
Diagnosing distributed systems with self-propelled instrumentation
Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware
Understanding customer problem troubleshooting from storage system logs
FAST '09 Proccedings of the 7th conference on File and storage technologies
DIADS: addressing the "my-problem-or-yours" syndrome with integrated SAN and database diagnosis
FAST '09 Proccedings of the 7th conference on File and storage technologies
Configuration-space performance anomaly depiction
LADIS '08 Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware
Ranking the importance of alerts for problem determination in large computer systems
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Automated anomaly detection and performance modeling of enterprise applications
ACM Transactions on Computer Systems (TOCS)
Detecting large-scale system problems by mining console logs
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
How to keep your head above water while detecting errors
Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
Stateful error detection in high throughput applications
Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
Macroscope: end-point approach to networked application dependency discovery
Proceedings of the 5th international conference on Emerging networking experiments and technologies
A survey of online failure prediction methods
ACM Computing Surveys (CSUR)
Towards a Fault-Tolerant Architecture for Enterprise Application Integration Solutions
OTM '09 Proceedings of the Confederated International Workshops and Posters on On the Move to Meaningful Internet Systems: ADI, CAMS, EI2N, ISDE, IWSSA, MONET, OnToContent, ODIS, ORM, OTM Academy, SWWS, SEMELS, Beyond SAWSDL, and COMBEK 2009
Efficient decision tree construction for mining time-varying data streams
CASCON '09 Proceedings of the 2009 Conference of the Center for Advanced Studies on Collaborative Research
SelfTalk for Dena: query language and runtime support for evaluating system behavior
ACM SIGOPS Operating Systems Review
Fingerprinting the datacenter: automated classification of performance crises
Proceedings of the 5th European conference on Computer systems
Towards versatile performance models for complex, popular applications
ACM SIGMETRICS Performance Evaluation Review
A query language for understanding component interactions in production systems
Proceedings of the 24th ACM International Conference on Supercomputing
Practical performance models for complex, popular applications
Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A query language and runtime tool for evaluating behavior of multi-tier servers
Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
How to keep your head above water while detecting errors
Middleware'09 Proceedings of the ACM/IFIP/USENIX 10th international conference on Middleware
Adaptive system anomaly prediction for large-scale hosting infrastructures
Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Towards automatic inference of task hierarchies in complex systems
HotDep'08 Proceedings of the Fourth conference on Hot topics in system dependability
Automating network application dependency discovery: experiences, limitations, and new solutions
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Quanto: tracking energy in networked embedded systems
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Detecting user-visible failures in AJAX web applications by analyzing users' interaction behaviors
Proceedings of the IEEE/ACM international conference on Automated software engineering
Gestalt: integrated support for implementation and analysis in machine learning
UIST '10 Proceedings of the 23nd annual ACM symposium on User interface software and technology
Lowering the barrier to applying machine learning
UIST '10 Adjunct proceedings of the 23nd annual ACM symposium on User interface software and technology
Diagnosing mobile applications in the wild
Hotnets-IX Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks
Mining netflow records for critical network activities
AIMS'10 Proceedings of the Mechanisms for autonomous management of networks and services, and 4th international conference on Autonomous infrastructure, management and security
Proceedings of the 2010 Conference of the Center for Advanced Studies on Collaborative Research
Automating configuration troubleshooting with dynamic information flow analysis
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Experience mining Google's production console logs
SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
Analyzing web logs to detect user-visible failures
SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
Diagnosing performance changes by comparing request flows
Proceedings of the 8th USENIX conference on Networked systems design and implementation
Profiling network performance for multi-tier data center applications
Proceedings of the 8th USENIX conference on Networked systems design and implementation
A root cause localization model for large scale systems
HotDep'05 Proceedings of the First conference on Hot topics in system dependability
HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
Friday: global comprehension for distributed replay
NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Efficient decision tree re-alignment for clustering time-changing data streams
From active data management to event-based systems and more
Clustering performance anomalies in web applications based on root causes
Proceedings of the 8th ACM international conference on Autonomic computing
G2: a graph processing system for diagnosing distributed systems
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
A proposal to detect errors in Enterprise Application Integration solutions
Journal of Systems and Software
Structured comparative analysis of systems logs to diagnose performance problems
NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Ensuring reliability in B2B services: Fault tolerant inter-organizational workflows
Information Systems Frontiers
What is my program doing? program dynamics in programmer's terms
RV'11 Proceedings of the Second international conference on Runtime verification
NetPilot: automating datacenter network failure mitigation
Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication
Automated diagnosis without predictability is a recipe for failure
HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
NetPilot: automating datacenter network failure mitigation
ACM SIGCOMM Computer Communication Review - Special october issue SIGCOMM '12
AppInsight: mobile app performance monitoring in the wild
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
X-ray: automating root-cause diagnosis of performance anomalies in production software
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
On the accurate identification of network service dependencies in distributed systems
lisa'12 Proceedings of the 26th international conference on Large Installation System Administration: strategies, tools, and techniques
Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering
An online service-oriented performance profiling tool for cloud computing systems
Frontiers of Computer Science: Selected Publications from Chinese Universities
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
ACM SIGOPS 24th Symposium on Operating Systems Principles
Timecard: controlling user-perceived delays in server-based mobile applications
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
On fault resilience of OpenStack
Proceedings of the 4th annual Symposium on Cloud Computing
Workload-aware anomaly detection for Web applications
Journal of Systems and Software
Hi-index | 0.00 |
We present a new approach to managing failures and evolution in large, complex distributed systems using runtime paths. We use the paths that requests follow as they move through the system as our core abstraction, and our "macro" approach focuses on component interactions rather than the details of the components themselves. Paths record component performance and interactions, are user- and request-centric, and occur in sufficient volume to enable statistical analysis, all in a way that is easily reusable across applications. Automated statistical analysis of multiple paths allows for the detection and diagnosis of complex failures and the assessment of evolution issues. In particular, our approach enables significantly stronger capabilities in failure detection, failure diagnosis, impact analysis, and understanding system evolution. We explore these capabilities with three real implementations, two of which service millions of requests per day. Our contributions include the approach; the maintainable, extensible, and reusable architecture; the various statistical analysis engines; and the discussion of our experience with a high-volume production service over several years.