Implementing fault-tolerant services using the state machine approach: a tutorial
ACM Computing Surveys (CSUR)
A partial approach to model checking
Papers presented at the IEEE symposium on Logic in computer science
Impossibility of distributed consensus with one faulty process
Journal of the ACM (JACM)
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
IEEE Transactions on Software Engineering - Special issue on formal methods in software practice
ACM Transactions on Computer Systems (TOCS)
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Towards capturing representative AS-level Internet topologies
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Asynchronous recovery without using vector timestamps
Journal of Parallel and Distributed Computing
Chord: a scalable peer-to-peer lookup protocol for internet applications
IEEE/ACM Transactions on Networking (TON)
Boolean and Cartesian Abstraction for Model Checking C Programs
TACAS 2001 Proceedings of the 7th International Conference on Tools and Algorithms for the Construction and Analysis of Systems
CMC: a pragmatic approach to model checking real code
ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
SplitStream: high-bandwidth multicast in cooperative environments
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Automatic detection and repair of errors in data structures
OOPSLA '03 Proceedings of the 18th annual ACM SIGPLAN conference on Object-oriented programing, systems, languages, and applications
Dynamic partial-order reduction for model checking software
Proceedings of the 32nd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Scalability and accuracy in a large-scale network emulator
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
OpenDHT: a public DHT service and its uses
Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications
Vigilante: end-to-end containment of internet worms
Proceedings of the twentieth ACM symposium on Operating systems principles
Speculative execution in a distributed file system
Proceedings of the twentieth ACM symposium on Operating systems principles
Rx: treating bugs as allergies---a safe method to survive software failures
Proceedings of the twentieth ACM symposium on Operating systems principles
Using model checking to find serious file system errors
ACM Transactions on Computer Systems (TOCS)
Using queries for distributed monitoring and forensics
Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Maintaining high bandwidth under dynamic network conditions
ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Flashback: a lightweight extension for rollback and deterministic replay for software debugging
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Mace: language support for building distributed systems
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Model checking large network protocol implementations
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
MACEDON: methodology for automatically creating, evaluating, and designing overlay networks
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
FUSE: lightweight guaranteed distributed failure notification
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using magpie for request extraction and workload modelling
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Enhancing server availability and security through failure-oblivious computing
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using random subsets to build scalable network services
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Pip: detecting the unexpected in distributed systems
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Paxos made live: an engineering perspective
Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Bouncer: securing software by blocking bad input
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Consensus routing: the internet as a distributed system
NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
D3S: debugging deployed distributed systems
NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
MODIST: transparent model checking of unmodified distributed systems
NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Network imprecision: a new consistency metric for scalable monitoring
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Life, death, and the critical transition: finding liveness bugs in systems code
NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
WiDS checker: combating bugs in distributed systems
NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Friday: global comprehension for distributed replay
NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
A case for end system multicast
IEEE Journal on Selected Areas in Communications
MODIST: transparent model checking of unmodified distributed systems
NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Predicting and preventing inconsistencies in deployed distributed systems
ACM Transactions on Computer Systems (TOCS)
KleeNet: discovering insidious interaction bugs in wireless sensor networks before deployment
Proceedings of the 9th ACM/IEEE International Conference on Information Processing in Sensor Networks
Simplifying distributed system development
HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
Fault prediction in distributed systems gone wild
Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware
Synoptic: summarizing system logs with refinement
SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
Model checking a networked system without the network
Proceedings of the 8th USENIX conference on Networked systems design and implementation
FATE and DESTINI: a framework for cloud recovery testing
Proceedings of the 8th USENIX conference on Networked systems design and implementation
Finding concurrency errors in sequential code: OS-level, in-vivo model checking of process races
HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Toward online testing of federated and heterogeneous distributed systems
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Finding protocol manipulation attacks
Proceedings of the ACM SIGCOMM 2011 conference
Mining temporal invariants from partially ordered logs
SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Practical software model checking via dynamic interface reduction
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Pervasive detection of process races in deployed systems
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Bootstrapping energy debugging on smartphones: a first look at energy bugs in mobile devices
Proceedings of the 10th ACM Workshop on Hot Topics in Networks
Mining temporal invariants from partially ordered logs
ACM SIGOPS Operating Systems Review
Evaluating ordering heuristics for dynamic partial-order reduction techniques
FASE'10 Proceedings of the 13th international conference on Fundamental Approaches to Software Engineering
Using lightweight modeling to understand chord
ACM SIGCOMM Computer Communication Review
Structured comparative analysis of systems logs to diagnose performance problems
NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Towards communication-based steering of complex distributed systems
Proceedings of the 17th Monterey conference on Large-Scale Complex IT Systems: development, operation and management
Verifying systems rules using rule-directed symbolic execution
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Leveraging SDN layering to systematically troubleshoot networks
Proceedings of the second ACM SIGCOMM workshop on Hot topics in software defined networking
From software verification to `everyware' verification
Computer Science - Research and Development
NetCheck: network diagnoses from blackbox traces
NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation
Hi-index | 0.00 |
We propose a new approach for developing and deploying distributed systems, in which nodes predict distributed consequences of their actions, and use this information to detect and avoid errors. Each node continuously runs a state exploration algorithm on a recent consistent snapshot of its neighborhood and predicts possible future violations of specified safety properties. We describe a new state exploration algorithm, consequence prediction, which explores causally related chains of events that lead to property violation. This paper describes the design and implementation of this approach, termed CrystalBall. We evaluate CrystalBall on RandTree, BulletPrime, Paxos, and Chord distributed system implementations. We identified new bugs in mature Mace implementations of three systems. Furthermore, we show that if the bug is not corrected during system development, CrystalBall is effective in steering the execution away from inconsistent states at runtime.