Providing Non-stop Service for Message-Passing Based Parallel Applications with RADIC

  • Authors:
  • Guna Santos;Angelo Duarte;Dolores Rexachs;Emilio Luque

  • Affiliations:
  • Computer Architecture and Operating Systems Department, University Autonoma of Barcelona, Bellaterra, Spain 08193;Computer Architecture and Operating Systems Department, University Autonoma of Barcelona, Bellaterra, Spain 08193;Computer Architecture and Operating Systems Department, University Autonoma of Barcelona, Bellaterra, Spain 08193;Computer Architecture and Operating Systems Department, University Autonoma of Barcelona, Bellaterra, Spain 08193

  • Venue:
  • Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

The current supercomputers are almost achieving the petaflop level. These machines present a high number of interruptions in a relatively short time interval. Fault tolerance and preventive maintenance are key issues in order to enlarge the MTTI (Mean Time To Interrupt). In this paper we present how RADIC, a architecture for fault tolerance, provides different protection levels able to avoid system interruptions and allows the performance of preventive maintenance tasks. Our experiments show the effectiveness of our solution in order to keep a high availability with a large MTTI.