An autonomic operating environment for large-scale distributed applications

  • Authors:
  • Tobin J. Lehman;Robert G. Deen;James H. Kaufman

  • Affiliations:
  • IBM Research Division, Almaden Research Center, 650 Harry Rd, San Jose, CA 95120, USA. Tel.: +1 408 927 1781/ Fax: +1 408 927 2100/ E-mail: toby@almaden.ibm.com;IBM Research Division, Almaden Research Center, 650 Harry Rd, San Jose, CA 95120, USA. Tel.: +1 408 927 1781/ Fax: +1 408 927 2100/ E-mail: toby@almaden.ibm.com;IBM Research Division, Almaden Research Center, 650 Harry Rd, San Jose, CA 95120, USA. Tel.: +1 408 927 1781/ Fax: +1 408 927 2100/ E-mail: toby@almaden.ibm.com

  • Venue:
  • Integrated Computer-Aided Engineering - Autonomous Computing
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents the OptimalGrid Operating Environment, and shows how it tackles the difficulty of automatically building, distributing and running connected distributed parallel programs. One of the main goals of OptimalGrid is to hide the complexity of working with grid applications -- both the creation of the grid application itself and the preparation of it on the grid infrastructure. The OptimalGrid Operating Environment hides much of the complexity of building, deploying and running of applications by employing the use of autonomic techniques, such as goal-oriented operations, alternative workflow schedules and agent-peer communication. Unlike conventional script-based systems, OptimalGrid uses the abstraction of target goals to allow it more flexibility in handling errors and unexpected system events. The target goals can be achieved in multiple ways, and the selection of a goal solution can be based on several factors, including previous success, peer information and user assistance. The main environment components, the Grid Director and Grid Manager, interoperate in a high-level workflow environment, so target goals can be retried with alternative solutions or achieved with human interaction. The use of high-level goals, with multiple solutions that are determined by past success, peer cooperation or human interaction, coupled with a flexible retry mechanism, results in a novel approach for distributed operating environments.