Adaptive Fault Tolerance for Scalable Cluster Computing in Space

  • Authors:
  • Mark L. James;Andrew A. Shapiro;Paul L. Springer;Hans P. Zima

  • Affiliations:
  • JET PROPULSION LABORATORY, CALIFORNIA INSTITUTE OF TECHNOLOGY,PASADENA, CA 91109, USA;JET PROPULSION LABORATORY, CALIFORNIA INSTITUTE OF TECHNOLOGY,PASADENA, CA 91109, USA;JET PROPULSION LABORATORY, CALIFORNIA INSTITUTE OF TECHNOLOGY,PASADENA, CA 91109, USA;JET PROPULSION LABORATORY, CALIFORNIA INSTITUTE OF TECHNOLOGY,PASADENA, CA 91109, USA

  • Venue:
  • International Journal of High Performance Computing Applications
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Future missions of deep-space exploration face the challenge of building more capable autonomous spacecraft and planetary rovers. Given the communication latencies and bandwidth limitations for such missions, the need for increased autonomy becomes mandatory, along with the requirement for enhanced on-board computational capabilities while in deep-space or time-critical situations. This will result in dramatic changes in the way missions are conducted and supported by on-board computing systems. Specifically, the traditional approach of relying exclusively on radiation-hardened hardware and modular redundancy will not be able to deliver the required computational power. As a consequence, such systems are expected to include high-capability low-power components based on emerging commercial-off-the-shelf (COTS) multi-core technology. In this paper we describe the design of a generic framework for introspection that supports runtime monitoring and analysis of program execution as well as a feedback-oriented recovery from faults. Our focus is on providing flexible software fault tolerance matched to the requirements and properties of applications by exploiting knowledge that is either contained in an application knowledge base, provided by users, or automatically derived from specifications. A prototype implementation is currently in progress at the Jet Propulsion Laboratory, California Institute of Technology, targeting a cluster of cell broadband engines.