Adaptive Fault Tolerance for Scalable Cluster Computing in Space

Authors:
Mark L. James;Andrew A. Shapiro;Paul L. Springer;Hans P. Zima
Affiliations:
JET PROPULSION LABORATORY, CALIFORNIA INSTITUTE OF TECHNOLOGY,PASADENA, CA 91109, USA;JET PROPULSION LABORATORY, CALIFORNIA INSTITUTE OF TECHNOLOGY,PASADENA, CA 91109, USA;JET PROPULSION LABORATORY, CALIFORNIA INSTITUTE OF TECHNOLOGY,PASADENA, CA 91109, USA;JET PROPULSION LABORATORY, CALIFORNIA INSTITUTE OF TECHNOLOGY,PASADENA, CA 91109, USA
Venue:
International Journal of High Performance Computing Applications
Year:
2009

Citing 21
Cited 1

Supercompilers for parallel and vector computers

Supercompilers for parallel and vector computers
The advanced onboard signal processor (AOSP)

Advances in VLSI and Computer Systems
Intelligent agents

Multiagent systems
Chameleon: A Software Infrastructure for Adaptive Fault Tolerance

IEEE Transactions on Parallel and Distributed Systems
The Byzantine Generals Problem

ACM Transactions on Programming Languages and Systems (TOPLAS)
Fault Tolerance for Multicomputers: The Application Oriented Paradigm

Fault Tolerance for Multicomputers: The Application Oriented Paradigm
Principles of Program Analysis

Principles of Program Analysis
The Design and Analysis of Computer Algorithms

The Design and Analysis of Computer Algorithms
Analytic Verification of Flight Software

IEEE Intelligent Systems
Design and Validation of Portable Communication Infrastructure for Fault-Tolerant Cluster Middleware

CLUSTER '02 Proceedings of the IEEE International Conference on Cluster Computing
Fault-tolerant computing for radiation environments

Fault-tolerant computing for radiation environments
Design for Verification with Dynamic Assertions

SEW '05 Proceedings of the 29th Annual IEEE/NASA on Software Engineering Workshop
Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture

IBM Systems Journal
Introduction to Automata Theory, Languages, and Computation (3rd Edition)

Introduction to Automata Theory, Languages, and Computation (3rd Edition)
Oasis: Onboard autonomous science investigation system for opportunistic rover science: Research Articles

Journal of Field Robotics - Special Issue on Space Robotics, Part III
Chapter I: Notes on structured programming

Structured programming
Toward Application-Aware Security and Reliability

IEEE Security and Privacy
Isolation in Commodity Multicore Processors

Computer
Data Flow Supercomputers

Computer
Introduction to the cell broadband engine architecture

IBM Journal of Research and Development
Spin model checker, the: primer and reference manual

Spin model checker, the: primer and reference manual

Adaptive fault tolerance for many-core based space-borne computing

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Future missions of deep-space exploration face the challenge of building more capable autonomous spacecraft and planetary rovers. Given the communication latencies and bandwidth limitations for such missions, the need for increased autonomy becomes mandatory, along with the requirement for enhanced on-board computational capabilities while in deep-space or time-critical situations. This will result in dramatic changes in the way missions are conducted and supported by on-board computing systems. Specifically, the traditional approach of relying exclusively on radiation-hardened hardware and modular redundancy will not be able to deliver the required computational power. As a consequence, such systems are expected to include high-capability low-power components based on emerging commercial-off-the-shelf (COTS) multi-core technology. In this paper we describe the design of a generic framework for introspection that supports runtime monitoring and analysis of program execution as well as a feedback-oriented recovery from faults. Our focus is on providing flexible software fault tolerance matched to the requirements and properties of applications by exploiting knowledge that is either contained in an application knowledge base, provided by users, or automatically derived from specifications. A prototype implementation is currently in progress at the Jet Propulsion Laboratory, California Institute of Technology, targeting a cluster of cell broadband engines.