Using dynamic task level redundancy for OpenMP fault tolerance

  • Authors:
  • Oussama Tahan;Mohamed Shawky

  • Affiliations:
  • Centre de Recherches de Royallieu, Heudiasyc-UMR 6599 Université de Technologie de Compiégne, Compiegne cedex, France;Centre de Recherches de Royallieu, Heudiasyc-UMR 6599 Université de Technologie de Compiégne, Compiegne cedex, France

  • Venue:
  • ARCS'12 Proceedings of the 25th international conference on Architecture of Computing Systems
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Obtaining fault tolerant applications and systems is one of today's most important topics of research. Fault tolerance is becoming more and more essential in shared memory parallel programs and in multi/many core architectures due to the decreasing size of transistors and growing number of failures. Very few research works and techniques for fault tolerant OpenMP programs were studied. These few works are based on checkpoint and recovery, and on static thread level redundancy techniques. However, these approaches may illustrate scalability issues when the number of cores increases or when an unbalanced workload exists. To overcome these issues, we present in this paper a dynamic task level redundancy technique for fault tolerant OpenMP applications. Our method is based on dynamically applying a Triple Modular Redundancy for OpenMP tasks through a dedicated runtime and on applying a majority voting to guarantee correct results. Our flexible fault tolerant OpenMP approach has been evaluated for performance and fault coverage and it showed small overhead with good error detection and recovery rate.