Creating a transparent, distributed, and resilient computing environment: the OpenRTE project

Authors:
Ralph H. Castain;Jeffrey M. Squyres
Affiliations:
Los Alamos National Laboratory, Los Alamos, USA NM-87544;Cisco Systems, Inc., San Jose, USA CA-95134
Venue:
The Journal of Supercomputing
Year:
2007

Citing 11
Cited 0

A high-performance, portable implementation of the MPI message passing interface standard

Parallel Computing
J-Orchestra: Automatic Java Application Partitioning

ECOOP '02 Proceedings of the 16th European Conference on Object-Oriented Programming
HARNESS fault tolerant MPI design, usage and performance issues

Future Generation Computer Systems - Grid computing: Towards a new computing infrastructure
A component architecture for LAM/MPI (citation_only)

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Architectural Support for Uniprocessor and Multiprocessor Active Memory Systems

IEEE Transactions on Computers
Measuring the Robustness of a Resource Allocation

IEEE Transactions on Parallel and Distributed Systems
Peta-Scale Computing

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Improving the Performance of Software Distributed Shared Memory with Speculation

IEEE Transactions on Parallel and Distributed Systems
Webcom-G: grid enabled metacomputing

Neural, Parallel & Scientific Computations - Special issue: Grid computing
A semi-static approach to mapping dynamic iterative tasks onto heterogeneous computing systems

Journal of Parallel and Distributed Computing
The open run-time environment (OpenRTE): a transparent multi-cluster environment for high-performance computing

PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface

Quantified Score

Hi-index	0.00

Visualization

Abstract

Meeting the future computing needs of the scientific community will likely require the development of petascale computing environments based on the integration of significant numbers of processors into large-scale clusters, and the (possibly heterogeneous) aggregation of multiple clusters for use by individual and/or synchronized applications. Despite the best of efforts, such complex systems dictate that applications must expect to encounter failures of their computing resources and/or networks during the course of execution.The Open Run-Time Environment (OpenRTE) has been designed to support high-performance computing applications in such environments. Gaining acceptance by the user community requires that OpenRTE not only meet basic functional requirements, but must also provide users with (a) a transparent interface that avoids the need to customize applications when moving between specific computing and/or communication resources; (b) effective strategies that can be selected at run-time for dealing with faults; (c) transparent support for inter-process communication, resource discovery and allocation, and process launch across a variety of platforms; and (d) the ability to launch their applications remotely from their desktop, disconnect from them, and reconnect at a later time to monitor progress.This paper provides an updated description of OpenRTE and discusses its relation to the current grid protocols. In addition, we introduce the concept of resilient computing--a next-generation approach to fault tolerance--and describe how OpenRTE will utilize this concept in the future.