Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Pinpoint: Problem Determination in Large, Dynamic Internet Services
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Using Model Checking to Analyze Network Vulnerabilities
SP '00 Proceedings of the 2000 IEEE Symposium on Security and Privacy
Performance debugging for distributed systems of black boxes
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Multi-resolution Abnormal Trace Detection Using Varied-length N-grams and Automata
ICAC '05 Proceedings of the Second International Conference on Automatic Computing
Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications
Quantifying Skype user satisfaction
Proceedings of the 2006 conference on Applications, technologies, architectures, and protocols for computer communications
Model checking large network protocol implementations
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Detecting BGP configuration faults with static analysis
NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using magpie for request extraction and workload modelling
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Pip: detecting the unexpected in distributed systems
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Automated Rule-Based Diagnosis through a Distributed Monitor System
IEEE Transactions on Dependable and Secure Computing
Rule-based static analysis of network protocol implementations
Information and Computation
BorderPatrol: isolating events for black-box tracing
Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
D3S: debugging deployed distributed systems
NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Live Debugging of Distributed Systems
CC '09 Proceedings of the 18th International Conference on Compiler Construction: Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009
CrystalBall: predicting and preventing inconsistencies in deployed distributed systems
NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
NetPrints: diagnosing home network misconfigurations using shared knowledge
NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Detailed diagnosis in enterprise networks
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Detecting large-scale system problems by mining console logs
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
SherLog: error diagnosis by connecting clues from run-time logs
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Mining invariants from console logs for system problem detection
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Diagnosing performance changes by comparing request flows
Proceedings of the 8th USENIX conference on Networked systems design and implementation
X-trace: a pervasive network tracing framework
NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Friday: global comprehension for distributed replay
NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Leveraging existing instrumentation to automatically infer invariant-constrained models
Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering
Hi-index | 0.00 |
This paper introduces NetCheck, a tool designed to diagnose network problems in large and complex applications. NetCheck relies on blackbox tracing mechanisms, such as strace, to automatically collect sequences of network system call invocations generated by the application hosts. NetCheck performs its diagnosis by (1) totally ordering the distributed set of input traces, and by (2) utilizing a network model to identify points in the totally ordered execution where the traces deviated from expected network semantics. Our evaluation demonstrates that NetCheck is able to diagnose failures in popular and complex applications without relying on any application-or network-specific information. For instance, NetCheck correctly identified the existence of NAT devices, simultaneous network disconnection/ reconnection, and platform portability issues. In a more targeted evaluation, NetCheck correctly detects over 95% of the network problems we found from bug trackers of projects like Python, Apache, and Ruby. When applied to traces of faults reproduced in a live network, NetCheck identified the primary cause of the fault in 90% of the cases. Additionally, NetCheck is efficient and can process a GB-long trace in about 2 minutes.