Latency and bandwidth-minimizing failure detectors
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Design of the notification system for failure detectors
International Journal of High Performance Computing and Networking
IEEE Journal on Selected Areas in Communications - Special issue on wireless and pervasive communications for healthcare
Journal of Parallel and Distributed Computing
FaDe: RESTful service for failure detection in SOA environment
PaCT'11 Proceedings of the 11th international conference on Parallel computing technologies
Failure detection in a RESTful way
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part II
Hi-index | 0.00 |
For many years, people have been advocating the development of failure detection as a basic service, but, unfortunately, without meeting much success so far. We believe that this comes from the fact that important system engineering issues have not yet been addressed adequately, thus preventing the definition of a truly generic service. Ultimately, our goal is to define a service that is both simple and expressive, yet powerful enough to support the requirements of many distributed applications. To this end, we consider an alternative interaction model between the service and the applications, called accrual failure detectors. Roughly, an accrual failure detector associates to each process a real value representing a suspicion level, instead of the traditional binary information (i.e., trust vs. suspect). In this paper, we provide a rigorous definition for accrual failure detectors, demonstrate that changing the interaction model leads to no loss in computational power, discuss quality of service issues, and present several possible implementations.