Automated packet trace analysis of TCP implementations
SIGCOMM '97 Proceedings of the ACM SIGCOMM '97 conference on Applications, technologies, architectures, and protocols for computer communication
BLT: Bi-layer tracing of HTTP and TCP&slash;IP
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
An open graph visualization system and its applications to software engineering
Software—Practice & Experience - Special issue on discrete algorithm engineering
A non-instrusive, wavelet-based approach to detecting network performance problems
IMW '01 Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement
DPM: A Measurement System for Distributed Programs
IEEE Transactions on Computers
Learning Stochastic Regular Grammars by Means of a State Merging Method
ICGI '94 Proceedings of the Second International Colloquium on Grammatical Inference and Applications
Automatic Generation of a Software Performance Model Using an Object-Oriented Prototype
MASCOTS '95 Proceedings of the 3rd International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems
Pinpoint: Problem Determination in Large, Dynamic Internet Services
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Gprof: A call graph execution profiler
SIGPLAN '82 Proceedings of the 1982 SIGPLAN symposium on Compiler construction
The NetLogger Methodology for High Performance Distributed Systems Performance Analysis
HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
ICDCS '99 Proceedings of the 19th IEEE International Conference on Distributed Computing Systems
Using runtime paths for macroanalysis
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
SSYM'00 Proceedings of the 9th conference on USENIX Security Symposium - Volume 9
User-level internet path diagnosis
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
TraceBack: first fault diagnosis by reconstruction of distributed control flow
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
IMPuLSE: integrated monitoring and profiling for large-scale environments
LCR '04 Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems
Deconstructing Commodity Storage Clusters
Proceedings of the 32nd annual international symposium on Computer Architecture
Failure detection and localization in component based systems by online tracking
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Capturing, indexing, clustering, and retrieving system history
Proceedings of the twentieth ACM symposium on Operating systems principles
Daphne: performance debugging using model-driven anomaly characterization
Proceedings of the twentieth ACM symposium on Operating systems principles
Web services navigator: visualizing the execution of web services
IBM Systems Journal
Autonomous recovery in componentized Internet applications
Cluster Computing
Proceedings of the 11th workshop on ACM SIGOPS European workshop
Request extraction in Magpie: events, schemas and temporal joins
Proceedings of the 11th workshop on ACM SIGOPS European workshop
WAP5: black-box performance debugging for wide-area systems
Proceedings of the 15th international conference on World Wide Web
Automated Online Monitoring of Distributed Applications through External Monitors
IEEE Transactions on Dependable and Secure Computing
Challenges in managing dependable data systems
ACM SIGMETRICS Performance Evaluation Review - Design, implementation, and performance of storage systems
Stardust: tracking activity in a distributed storage system
SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
Architecture-driven platform independent deterministic replay for distributed hard real-time systems
Proceedings of the ISSTA 2006 workshop on Role of software architecture for testing and analysis
Execution patterns for visualizing web services
SoftVis '06 Proceedings of the 2006 ACM symposium on Software visualization
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Towards self-predicting systems: What if you could ask ‘what-if’?
The Knowledge Engineering Review
Modeling and Tracking of Transaction Flow Dynamics for Fault Detection in Complex Systems
IEEE Transactions on Dependable and Secure Computing
Automated known problem diagnosis with event traces
Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Using queries for distributed monitoring and forensics
Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
TestEJB: response time measurement and call dependency tracing for EJBs
MAI '07 Proceedings of the 1st workshop on Middleware-application interaction: in conjunction with Euro-Sys 2007
Performance problem localization in self-healing, service-oriented systems using Bayesian networks
Proceedings of the 2007 ACM symposium on Applied computing
Improved error reporting for software that uses black-box components
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Mace: language support for building distributed systems
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
I/O system performance debugging using model-driven anomaly characterization
FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Path-based faliure and evolution management
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Configuration debugging as search: finding the needle in the haystack
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Automatic misconfiguration troubleshooting with peerpressure
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using magpie for request extraction and workload modelling
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Detecting performance anomalies in global applications
WORLDS'05 Proceedings of the 2nd conference on Real, Large Distributed Systems - Volume 2
Whodunit: transactional profiling for multi-tier applications
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Towards highly reliable enterprise network services via inference of multi-level dependencies
Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications
Observer: keeping system models from becoming obsolete
HotAC II Hot Topics in Autonomic Computing on Hot Topics in Autonomic Computing
AjaxScope: a platform for remotely monitoring the client-side behavior of web 2.0 applications
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Triage: diagnosing production run failures at the user's site
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
/*icomment: bugs or bad comments?*/
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
AutoBash: improving configuration management with operating system causality analysis
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Operating system profiling via latency analysis
OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Automated Rule-Based Diagnosis through a Distributed Monitor System
IEEE Transactions on Dependable and Secure Computing
Architecture-driven diagnosis of performance failures in a token ring
HotDep'07 Proceedings of the 3rd workshop on on Hot Topics in System Dependability
eMIVA: tool support for the instrumentation of critical distributed applications
ACM SIGMETRICS Performance Evaluation Review
Hardware counter driven on-the-fly request signatures
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Understanding and visualizing full systems with data flow tomography
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
CAMP: a common API for measuring performance
LISA'07 Proceedings of the 21st conference on Large Installation System Administration Conference
Hang analysis: fighting responsiveness bugs
Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
BorderPatrol: isolating events for black-box tracing
Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Live monitoring: using adaptive instrumentation and analysis to debug and maintain web applications
HOTOS'07 Proceedings of the 11th USENIX workshop on Hot topics in operating systems
Why did my pc suddenly slow down?
SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
SPIKE: best practice generation for storage area networks
SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
Fingerpointing correlated failures in replicated systems
SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
DMTracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Scalable problem localization for distributed systems: principles and practices
Proceedings of the 2nd international conference on Scalable information systems
Tracking in a spaghetti bowl: monitoring transactions using footprints
SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
DieCast: testing distributed systems with an accurate scale model
NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
D3S: debugging deployed distributed systems
NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Remote profiling of resource constraints of web servers using mini-flash crowds
ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
A dollar from 15 cents: cross-platform management for internet services
ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Using causality to diagnose configuration bugs
ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Towards realistic file-system benchmarks with CodeMRI
ACM SIGMETRICS Performance Evaluation Review
Dustminer: troubleshooting interactive complexity bugs in sensor networks
Proceedings of the 6th ACM conference on Embedded network sensor systems
A model of process documentation to determine provenance in mash-ups
ACM Transactions on Internet Technology (TOIT)
IBM Journal of Research and Development
Active Diagnosis of High-Level Faults in Distributed Internet Services
APNOMS '08 Proceedings of the 11th Asia-Pacific Symposium on Network Operations and Management: Challenges for Next Generation Network Operations and Service Management
The Study of Multi-agent Network Flow Architecture for Application Performance Evaluation
KES-AMSTA '07 Proceedings of the 1st KES International Symposium on Agent and Multi-Agent Systems: Technologies and Applications
Dynamic dependencies and performance improvement
LISA'08 Proceedings of the 22nd conference on Large installation system administration conference
Diagnosing distributed systems with self-propelled instrumentation
Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware
Causeway: support for controlling and analyzing the execution of multi-tier applications
Proceedings of the ACM/IFIP/USENIX 2005 International Conference on Middleware
Isolation points: Creating performance-robust enterprise systems
ACM Transactions on Autonomous and Adaptive Systems (TAAS)
mBrace: action-based performance monitoring of multi-tier web applications
Proceedings of the Third Workshop on Dependable Distributed Data Management
Understanding customer problem troubleshooting from storage system logs
FAST '09 Proccedings of the 7th conference on File and storage technologies
DIADS: addressing the "my-problem-or-yours" syndrome with integrated SAN and database diagnosis
FAST '09 Proccedings of the 7th conference on File and storage technologies
Configuration-space performance anomaly depiction
LADIS '08 Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware
Ranking the importance of alerts for problem determination in large computer systems
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
AdaptGuard: guarding adaptive systems from instability
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
End-to-end performance forecasting: finding bottlenecks before they happen
Proceedings of the 36th annual international symposium on Computer architecture
RPC chains: efficient client-server communication in geodistributed systems
NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Finding Symbolic Bug Patterns in Sensor Networks
DCOSS '09 Proceedings of the 5th IEEE International Conference on Distributed Computing in Sensor Systems
Troubleshooting thousands of jobs on production grids using data mining techniques
GRID '08 Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing
Automated anomaly detection and performance modeling of enterprise applications
ACM Transactions on Computer Systems (TOCS)
How to keep your head above water while detecting errors
Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
Stateful error detection in high throughput applications
Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
Macroscope: end-point approach to networked application dependency discovery
Proceedings of the 5th international conference on Emerging networking experiments and technologies
EbAT: online methods for detecting utility cloud anomalies
Proceedings of the 6th Middleware Doctoral Symposium
Managing massive time series streams with multi-scale compressed trickles
Proceedings of the VLDB Endowment
Performance debugging in data centers: doing more with less
COMSNETS'09 Proceedings of the First international conference on COMmunication Systems And NETworks
Ganesha: blackBox diagnosis of MapReduce systems
ACM SIGMETRICS Performance Evaluation Review
Do you know your IQ?: a research agenda for information quality in systems
ACM SIGMETRICS Performance Evaluation Review
SherLog: error diagnosis by connecting clues from run-time logs
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Run-time dependency tracking in data centers
Proceedings of the Third Annual ACM Bangalore Conference
SNTS: sensor network troubleshooting suite
DCOSS'07 Proceedings of the 3rd IEEE international conference on Distributed computing in sensor systems
Passive inspection of sensor networks
DCOSS'07 Proceedings of the 3rd IEEE international conference on Distributed computing in sensor systems
Towards versatile performance models for complex, popular applications
ACM SIGMETRICS Performance Evaluation Review
Are clouds ready for large distributed applications?
ACM SIGOPS Operating Systems Review
CloudXplor: a tool for configuration planning in clouds based on empirical data
Proceedings of the 2010 ACM Symposium on Applied Computing
Assessing operational impact in enterprise systems by mining usage patterns
DSOM'07 Proceedings of the Distributed systems: operations and management 18th IFIP/IEEE international conference on Managing virtualization of networks and services
PeerWatch: a fault detection and diagnosis tool for virtualized consolidation systems
Proceedings of the 7th international conference on Autonomic computing
A query language for understanding component interactions in production systems
Proceedings of the 24th ACM International Conference on Supercomputing
Practical performance models for complex, popular applications
Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
How to keep your head above water while detecting errors
Middleware'09 Proceedings of the ACM/IFIP/USENIX 10th international conference on Middleware
FlowRank: ranking NetFlow records
Proceedings of the 6th International Wireless Communications and Mobile Computing Conference
Measurement and diagnosis of address misconfigured P2P traffic
INFOCOM'10 Proceedings of the 29th conference on Information communications
Adaptive system anomaly prediction for large-scale hosting infrastructures
Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
AjaxScope: A Platform for Remotely Monitoring the Client-Side Behavior of Web 2.0 Applications
ACM Transactions on the Web (TWEB)
Cloudward bound: planning for beneficial migration of enterprise applications to the cloud
Proceedings of the ACM SIGCOMM 2010 conference
Black-box problem diagnosis in parallel file systems
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Mochi: visual log-analysis based tools for debugging hadoop
HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
Towards automatic inference of task hierarchies in complex systems
HotDep'08 Proceedings of the Fourth conference on Hot topics in system dependability
Automating network application dependency discovery: experiences, limitations, and new solutions
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
CLUEBOX: a performance log analyzer for automated troubleshooting
WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
SALSA: analyzing logs as state machines
WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Look who's talking: discovering dependencies between virtual machines using CPU utilization
HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
Experiences with tracing causality in networked services
INM/WREN'10 Proceedings of the 2010 internet network management conference on Research on enterprise networking
Diagnosing mobile applications in the wild
Hotnets-IX Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks
Performance analysis of idle programs
Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Mining netflow records for critical network activities
AIMS'10 Proceedings of the Mechanisms for autonomous management of networks and services, and 4th international conference on Autonomous infrastructure, management and security
FlowChecker: Detecting Bugs in MPI Libraries via Message Flow Checking
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Behavior-based problem localization for parallel file systems
HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Automating configuration troubleshooting with dynamic information flow analysis
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Visual debugging for stream processing applications
RV'10 Proceedings of the First international conference on Runtime verification
Improving software diagnosability via log enhancement
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
MT-WAVE: profiling multi-tier web applications
Proceedings of the 2nd ACM/SPEC International Conference on Performance engineering
DieCast: Testing Distributed Systems with an Accurate Scale Model
ACM Transactions on Computer Systems (TOCS)
QoSaaS: quality of service as a service
Hot-ICE'11 Proceedings of the 11th USENIX conference on Hot topics in management of internet, cloud, and enterprise networks and services
Diagnosing performance changes by comparing request flows
Proceedings of the 8th USENIX conference on Networked systems design and implementation
HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
WiDS checker: combating bugs in distributed systems
NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
X-trace: a pervasive network tracing framework
NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Friday: global comprehension for distributed replay
NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
ASDF: an automated, online framework for diagnosing performance problems
Architecting dependable systems VII
LIME: a framework for debugging load imbalance in multi-threaded execution
Proceedings of the 33rd International Conference on Software Engineering
Rake: semantics assisted network-based tracing framework
Proceedings of the Nineteenth International Workshop on Quality of Service
A flexible architecture integrating monitoring and analytics for managing large-scale data centers
Proceedings of the 8th ACM international conference on Autonomic computing
G2: a graph processing system for diagnosing distributed systems
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
OFRewind: enabling record and replay troubleshooting for networks
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
BotTrack: tracking botnets using NetFlow and PageRank
NETWORKING'11 Proceedings of the 10th international IFIP TC 6 conference on Networking - Volume Part I
PAL: Propagation-aware Anomaly Localization for cloud hosted distributed applications
SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
BLR-D: applying bilinear logistic regression to factored diagnosis problems
SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
PipeCloud: using causality to overcome speed-of-light delays in cloud-based disaster recovery
Proceedings of the 2nd ACM Symposium on Cloud Computing
Using link gradients to predict the impact of network latency on multitier applications
IEEE/ACM Transactions on Networking (TON)
Bootstrapping energy debugging on smartphones: a first look at energy bugs in mobile devices
Proceedings of the 10th ACM Workshop on Hot Topics in Networks
Fast extraction of adaptive change point based patterns for problem resolution in enterprise systems
DSOM'06 Proceedings of the 17th IFIP/IEEE international conference on Distributed Systems: operations and management
BLR-D: applying bilinear logistic regression to factored diagnosis problems
ACM SIGOPS Operating Systems Review
Improving Software Diagnosability via Log Enhancement
ACM Transactions on Computer Systems (TOCS) - Special Issue APLOS 2011
Dataflow Tomography: Information Flow Tracking For Understanding and Visualizing Full Systems
ACM Transactions on Architecture and Code Optimization (TACO)
Modeling the parallel execution of black-box services
HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
Causeway: support for controlling and analyzing the execution of multi-tier applications
Middleware'05 Proceedings of the ACM/IFIP/USENIX 6th international conference on Middleware
Modellus: Automated modeling of complex internet data center applications
ACM Transactions on the Web (TWEB)
DAPA: diagnosing application performance anomalies for virtualized infrastructures
Hot-ICE'12 Proceedings of the 2nd USENIX conference on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services
Structured comparative analysis of systems logs to diagnose performance problems
NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Understanding and detecting real-world performance bugs
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Automatically finding performance problems with feedback-directed learning software testing
Proceedings of the 34th International Conference on Software Engineering
What is my program doing? program dynamics in programmer's terms
RV'11 Proceedings of the Second international conference on Runtime verification
NetPilot: automating datacenter network failure mitigation
Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication
Net-cohort: detecting and managing VM ensembles in virtualized data centers
Proceedings of the 9th international conference on Autonomic computing
PowerTracer: tracing requests in multi-tier services to diagnose energy inefficiency
Proceedings of the 9th international conference on Autonomic computing
NetPilot: automating datacenter network failure mitigation
ACM SIGCOMM Computer Communication Review - Special october issue SIGCOMM '12
Collaborative energy debugging for mobile devices
HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
AppInsight: mobile app performance monitoring in the wild
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Be conservative: enhancing failure diagnosis with proactive logging
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
X-ray: automating root-cause diagnosis of performance anomalies in production software
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Online black-box failure prediction for mission critical distributed systems
SAFECOMP'12 Proceedings of the 31st international conference on Computer Safety, Reliability, and Security
A Recovery-Oriented Approach for Software Fault Diagnosis in Complex Critical Systems
International Journal of Adaptive, Resilient and Autonomic Systems
VScope: middleware for troubleshooting time-sensitive data center applications
Proceedings of the 13th International Middleware Conference
A study of unpredictability in fault-tolerant middleware
Computer Networks: The International Journal of Computer and Telecommunications Networking
Evaluating MapReduce for profiling application traffic
Proceedings of the first edition workshop on High performance and programmable networking
Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering
Analytical modeling for what-if analysis in complex cloud computing applications
ACM SIGMETRICS Performance Evaluation Review
Adaptive monitoring of web-based applications: a performance study
Proceedings of the 28th Annual ACM Symposium on Applied Computing
Juggling the Jigsaw: towards automated problem inference from network trouble tickets
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
A characteristic study on failures of production distributed data-parallel programs
Proceedings of the 2013 International Conference on Software Engineering
An online service-oriented performance profiling tool for cloud computing systems
Frontiers of Computer Science: Selected Publications from Chinese Universities
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
ACM SIGOPS 24th Symposium on Operating Systems Principles
Carat: collaborative energy diagnosis for mobile devices
Proceedings of the 11th ACM Conference on Embedded Networked Sensor Systems
Asynchronous intrusion recovery for interconnected web services
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
ROOT: replaying multithreaded traces with resource-oriented ordering
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
On fault resilience of OpenStack
Proceedings of the 4th annual Symposium on Cloud Computing
Troubleshooting interactive complexity bugs in wireless sensor networks using data mining techniques
ACM Transactions on Sensor Networks (TOSN)
Verifiable network function outsourcing: requirements, challenges, and roadmap
Proceedings of the 2013 workshop on Hot topics in middleboxes and network function virtualization
Comprehending performance from real-world execution traces: a device-driver case
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Performance troubleshooting in data centers: an annotated bibliography?
ACM SIGOPS Operating Systems Review
Making problem diagnosiswork for large-scale, production storage systems
LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
NetCheck: network diagnoses from blackbox traces
NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation
Hi-index | 0.00 |
Many interesting large-scale systems are distributed systems of multiple communicating components. Such systems can be very hard to debug, especially when they exhibit poor performance. The problem becomes much harder when systems are composed of "black-box" components: software from many different (perhaps competing) vendors, usually without source code available. Typical solutions-provider employees are not always skilled or experienced enough to debug these systems efficiently. Our goal is to design tools that enable modestly-skilled programmers (and experts, too) to isolate performance bottlenecks in distributed systems composed of black-box nodes.We approach this problem by obtaining message-level traces of system activity, as passively as possible and without any knowledge of node internals or message semantics. We have developed two very different algorithms for inferring the dominant causal paths through a distributed system from these traces. One uses timing information from RPC messages to infer inter-call causality; the other uses signal-processing techniques. Our algorithms can ascribe delay to specific nodes on specific causal paths. Unlike previous approaches to similar problems, our approach requires no modifications to applications, middleware, or messages.