Co-designing the failure analysis and monitoring of large-scale systems

Authors:
Abhishek Chandra;Rohini Prinja;Sourabh Jain;ZhiLi Zhang
Affiliations:
University of Minnesota;University of Minnesota;University of Minnesota;University of Minnesota
Venue:
ACM SIGMETRICS Performance Evaluation Review
Year:
2008

Citing 10
Cited 1

SETI@home: an experiment in public-resource computing

Communications of the ACM
PlanetLab: an overlay testbed for broad-coverage services

ACM SIGCOMM Computer Communication Review
BOINC: A System for Public-Resource Computing and Storage

GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
CoMon: a mostly-scalable monitoring system for PlanetLab

ACM SIGOPS Operating Systems Review
Minimizing churn in distributed systems

Proceedings of the 2006 conference on Applications, technologies, architectures, and protocols for computer communications
Automated known problem diagnosis with event traces

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
PlanetSeer: internet path failure monitoring and characterization in wide-area services

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Flight data recorder: monitoring persistent-state interactions to improve systems management

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Exploiting availability prediction in distributed systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Subtleties in tolerating correlated failures in wide-area storage systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3

Monere: monitoring of service compositions for failure diagnosis

ICSOC'11 Proceedings of the 9th international conference on Service-Oriented Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large-scale distributed systems provide the backbone for numerous distributed applications and online services. These systems span over a multitude of computing nodes located at different geographical locations connected together via wide-area networks and overlays. A major concern with such systems is their susceptibility to failures leading to downtime of services and hence high monetary/business costs. In this paper, we argue that to understand failures in such a system, we need to co-design monitoring system with the failure analysis system. Unlike existing monitoring systems which are not designed specifically for failure analysis, we advocate a new way to design a monitoring system with the goal of uncovering causes of failures. Similarly the failure analysis techniques themselves need to go beyond simple statistical analysis of failure events in isolation to serve as an effective tool. Towards this end, we provide a discussion of some guiding principles for the co-design of monitoring and failure analysis systems for planetary scale systems.