Creating a transparent, distributed, and resilient computing environment: the OpenRTE project

  • Authors:
  • Ralph H. Castain;Jeffrey M. Squyres

  • Affiliations:
  • Los Alamos National Laboratory, Los Alamos, USA NM-87544;Cisco Systems, Inc., San Jose, USA CA-95134

  • Venue:
  • The Journal of Supercomputing
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Meeting the future computing needs of the scientific community will likely require the development of petascale computing environments based on the integration of significant numbers of processors into large-scale clusters, and the (possibly heterogeneous) aggregation of multiple clusters for use by individual and/or synchronized applications. Despite the best of efforts, such complex systems dictate that applications must expect to encounter failures of their computing resources and/or networks during the course of execution.The Open Run-Time Environment (OpenRTE) has been designed to support high-performance computing applications in such environments. Gaining acceptance by the user community requires that OpenRTE not only meet basic functional requirements, but must also provide users with (a) a transparent interface that avoids the need to customize applications when moving between specific computing and/or communication resources; (b) effective strategies that can be selected at run-time for dealing with faults; (c) transparent support for inter-process communication, resource discovery and allocation, and process launch across a variety of platforms; and (d) the ability to launch their applications remotely from their desktop, disconnect from them, and reconnect at a later time to monitor progress.This paper provides an updated description of OpenRTE and discusses its relation to the current grid protocols. In addition, we introduce the concept of resilient computing--a next-generation approach to fault tolerance--and describe how OpenRTE will utilize this concept in the future.