Automatic Failure-Path Inference: A Generic Introspection Technique for Internet Applications
WIAPP '03 Proceedings of the The Third IEEE Workshop on Internet Applications
Dependable Initialization of Large-Scale Distributed Software
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
An integrated experimental environment for distributed systems and networks
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Quality Attributes in Wireless Sensor Networks
SEUS '05 Proceedings of the Third IEEE Workshop on Software Technologies for Future Embedded and Ubiquitous Systems
Impact of random failures and attacks on Poisson and power-law random networks
ACM Computing Surveys (CSUR)
Failure recovery: when the cure is worse than the disease
HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Hi-index | 0.00 |
Large distributed systems contain multiple components that can interact in sometimes unforeseen and complicated ways; this emergent "vulnerability of complexity" increases the likelihood of cascading failures that might result in widespread disruption. Our research explores whether we can exploit the knowledge of the system's topology, the application's interconnections and the application's normal fault-free behavior to build proactive fault-tolerance techniques that could curb the spread of cascading failures and enable faster system-wide recovery. We seek to characterize what the topology knowledge would entail, quantify the benefits of our approach and understand the associated trade offs.