STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Chord: A scalable peer-to-peer lookup service for internet applications
Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
A scalable content-addressable network
Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
DNS performance and the effectiveness of caching
IMW '01 Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement
Kademlia: A Peer-to-Peer Information System Based on the XOR Metric
IPTPS '01 Revised Papers from the First International Workshop on Peer-to-Peer Systems
Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems
Middleware '01 Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg
The design and implementation of a next generation name service for the internet
Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications
OpenDHT: a public DHT service and its uses
Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications
Shrink: a tool for failure diagnosis in IP networks
Proceedings of the 2005 ACM SIGCOMM workshop on Mining network data
Beehive: O(1)lookup performance for power-law query distributions in peer-to-peer overlays
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Beehive: O(1)lookup performance for power-law query distributions in peer-to-peer overlays
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Democratizing content publication with coral
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
IP fault localization via risk modeling
NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
FUSE: lightweight guaranteed distributed failure notification
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using magpie for request extraction and workload modelling
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Fixing the embarrassing slowness of OpenDHT on PlanetLab
WORLDS'05 Proceedings of the 2nd conference on Real, Large Distributed Systems - Volume 2
Non-transitive connectivity and DHTs
WORLDS'05 Proceedings of the 2nd conference on Real, Large Distributed Systems - Volume 2
Towards highly reliable enterprise network services via inference of multi-level dependencies
Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications
Dynamo: amazon's highly available key-value store
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Detailed diagnosis in enterprise networks
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Hi-index | 0.00 |
Distributed hash tables (DHTs) have been adopted as a building block for large-scale distributed systems. The upshot of this success is that their robust operation is even more important as mission-critical applications begin to be layered on them. Even though DHTs can detect and heal around unresponsive hosts and disconnected links, several hidden faults and performance bottlenecks go undetected, resulting in unanswered queries and delayed responses. In this paper, we propose dFault, a system that helps large-scale DHTs to localize such faults. Informed with a log of failed queries called symptoms and some available information about the hosts in the DHT, dFault identifies the potential root causes (hosts and overlay links) that with high likelihood contributed towards those symptoms. Its design is based on the recently proposed dependency graph modeling and inference approach for fault localization. We describe the design of dFault, and show that it can accurately localize the root causes of faults with modest amount of information collected from individual nodes using a real prototype deployed over PlanetLab.