Robust design for distributed computing systems

  • Authors:
  • Jon B. Weissman;Darin Allen England

  • Affiliations:
  • University of Minnesota;University of Minnesota

  • Venue:
  • Robust design for distributed computing systems
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Robust systems have the ability to maintain performance under a wide variety of operating conditions. Although the notion of robustness is very intuitive and is generally considered desirable, there exist no widely used quantitative metrics for this property. In the first part of this work we define robustness in terms of its impact on performance and present a new technique for measuring and characterizing the robustness of a system to a specific disturbance. Unlike previous work, our approach does not require the use of sophisticated mathematical models. To show its efficacy, the metric is applied to three different scheduling problems. In the second part of this work we apply the methodology of dynamic programming to effect robust policies for making resource management decisions in the face of uncertainty. We apply this methodology in a novel way to new problems that are posed by the emergence of on-demand computing. We view the problem from the perspective of a software service provider whose objective is to minimize the cost of leasing resources and maintain an adequate quality of service. By using this methodology, service providers can make good leasing decisions in the face of such uncertainties as random demand for the service and random execution times of service requests. The resulting policies reduce the cost of hosting a service and significantly reduce its variance, an indication of robustness. In the third part of this work we develop and evaluate a new robust network topology for applications that operate on a spanning tree overlay network. Unlike previous work that is adaptive or reactive in nature, we take a proactive approach: the topology itself is able to simultaneously withstand disturbances and exhibit good performance. We present both centralized and distributed tree construction algorithms and evaluate their effectiveness through analysis and simulation of two classes of distributed applications: data collection in sensor networks, and data dissemination in divisible load scheduling. The results show that our robust spanning trees achieve a desirable trade-off for opposing performance metrics where more commonly used forms of spanning trees do not.