Leases: an efficient fault-tolerant mechanism for distributed file cache consistency
SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
Unreliable failure detectors for reliable distributed systems
Journal of the ACM (JACM)
ACM Transactions on Computer Systems (TOCS)
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Internet indirection infrastructure
Proceedings of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications
Pinpoint: Problem Determination in Large, Dynamic Internet Services
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
The Eigentrust algorithm for reputation management in P2P networks
WWW '03 Proceedings of the 12th international conference on World Wide Web
Performance debugging for distributed systems of black boxes
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Implementing declarative overlays
Proceedings of the twentieth ACM symposium on Operating systems principles
Automated, scalable debugging of MPI programs with Intel® Message Checker
Proceedings of the second international workshop on Software engineering for high performance computing system applications
Problem diagnosis in large-scale computing environments
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Using queries for distributed monitoring and forensics
Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Webstudio: building infrastructure for web data management
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Model checking large network protocol implementations
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Boxwood: abstractions as the foundation for storage infrastructure
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using magpie for request extraction and workload modelling
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using model checking to find serious file system errors
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Flight data recorder: monitoring persistent-state interactions to improve systems management
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Bigtable: a distributed storage system for structured data
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
The Chubby lock service for loosely-coupled distributed systems
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Replay debugging for distributed applications
ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Pip: detecting the unexpected in distributed systems
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Detours: binary interception of Win32 functions
WINSYM'99 Proceedings of the 3rd conference on USENIX Windows NT Symposium - Volume 3
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Paxos made live: an engineering perspective
Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Life, death, and the critical transition: finding liveness bugs in systems code
NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
WiDS checker: combating bugs in distributed systems
NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
X-trace: a pervasive network tracing framework
NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Friday: global comprehension for distributed replay
NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Hang analysis: fighting responsiveness bugs
Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Declarative Network Verification
PADL '09 Proceedings of the 11th International Symposium on Practical Aspects of Declarative Languages
Live Debugging of Distributed Systems
CC '09 Proceedings of the 18th International Conference on Compiler Construction: Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009
MODIST: transparent model checking of unmodified distributed systems
NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
CrystalBall: predicting and preventing inconsistencies in deployed distributed systems
NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Predicting and preventing inconsistencies in deployed distributed systems
ACM Transactions on Computer Systems (TOCS)
Efficient querying and maintenance of network provenance at internet-scale
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Queuing dynamics and single-link stability of delay-based window congestion control
Computer Networks: The International Journal of Computer and Telecommunications Networking
A query language for understanding component interactions in production systems
Proceedings of the 24th ACM International Conference on Supercomputing
Measurement and diagnosis of address misconfigured P2P traffic
INFOCOM'10 Proceedings of the 29th conference on Information communications
Towards automatic inference of task hierarchies in complex systems
HotDep'08 Proceedings of the Fourth conference on Hot topics in system dependability
R2: an application-level kernel for record and replay
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Efficient diagnostic tracing for wireless sensor networks
Proceedings of the 8th ACM Conference on Embedded Networked Sensor Systems
Language-based replay via data flow cut
Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
Towards automatically checking thousands of failures with micro-specifications
HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
FATE and DESTINI: a framework for cloud recovery testing
Proceedings of the 8th USENIX conference on Networked systems design and implementation
G2: a graph processing system for diagnosing distributed systems
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Debugging the data plane with anteater
Proceedings of the ACM SIGCOMM 2011 conference
Mining temporal invariants from partially ordered logs
SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Mining temporal invariants from partially ordered logs
ACM SIGOPS Operating Systems Review
Capturing performance assumptions using stochastic performance logic
ICPE '12 Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering
Structured comparative analysis of systems logs to diagnose performance problems
NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Diagnostic tracing for wireless sensor networks
ACM Transactions on Sensor Networks (TOSN)
Failure recovery: when the cure is worse than the disease
HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Carat: collaborative energy diagnosis for mobile devices
Proceedings of the 11th ACM Conference on Embedded Networked Sensor Systems
On fault resilience of OpenStack
Proceedings of the 4th annual Symposium on Cloud Computing
Performance troubleshooting in data centers: an annotated bibliography?
ACM SIGOPS Operating Systems Review
Global property violation detection and diagnosis for wireless sensor networks
Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
NetCheck: network diagnoses from blackbox traces
NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation
Hi-index | 0.00 |
Testing large-scale distributed systems is a challenge, because some errors manifest themselves only after a distributed sequence of events that involves machine and network failures. D3S is a checker that allows developers to specify predicates on distributed properties of a deployed system, and that checks these predicates while the system is running. When D3S finds a problem it produces the sequence of state changes that led to the problem, allowing developers to quickly find the root cause. Developers write predicates in a simple and sequential programming style, while D3S checks these predicates in a distributed and parallel manner to allow checking to be scalable to large systems and fault tolerant. By using binary instrumentation, D3S works transparently with legacy systems and can change predicates to be checked at runtime. An evaluation with 5 deployed systems shows that D3S can detect non-trivial correctness and performance bugs at runtime and with low performance overhead (less than 8%).