Toward automatic policy refinement in repair services for large distributed systems

Authors:
Moises Goldszmidt;Mihai Budiu;Yue Zhang;Michael Pechuk
Affiliations:
Microsoft Research;Microsoft Research;Microsoft Windows Azure;Microsoft Windows Azure
Venue:
ACM SIGOPS Operating Systems Review
Year:
2010

Citing 6
Cited 1

Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,

Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Improving availability with recursive microreboots: a soft-state system case study

Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
Autopilot: automatic data center management

ACM SIGOPS Operating Systems Review - Systems work at Microsoft Research
An Interior-Point Method for Large-Scale l1-Regularized Logistic Regression

The Journal of Machine Learning Research
The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines
Hunting for problems with Artemis

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs

Performance troubleshooting in data centers: an annotated bibliography?

ACM SIGOPS Operating Systems Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

In order to be economically feasible and to offer high levels of availability and performance, large scale distributed systems depend on the automation of repair services. While there has been considerable work on mechanisms for such automated services, a framework for evaluating and optimizing the policies governing such mechanisms has been lacking. In this paper we propose one such framework and report on our initial experience in applying the framework to analyze and optimize the operation a geo-distributed cloud storage system at Microsoft.