REPAS: Reliable Execution for Parallel ApplicationS in Tiled-CMPs

  • Authors:
  • Daniel Sánchez;Juan L. Aragón;José M. García

  • Affiliations:
  • Departamento de Ingeniería y Tecnología de Computadores, Universidad de Murcia, Murcia, (Spain) 30071;Departamento de Ingeniería y Tecnología de Computadores, Universidad de Murcia, Murcia, (Spain) 30071;Departamento de Ingeniería y Tecnología de Computadores, Universidad de Murcia, Murcia, (Spain) 30071

  • Venue:
  • Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Reliability has become a first-class consideration issue for architects along with performance and energy-efficiency. The increasing scaling technology and subsequent supply voltage reductions, together with temperature fluctuations, augment the susceptibility of architectures to errors. Previous approaches have tried to provide fault tolerance. However, they usually present critical drawbacks concerning either hardware duplication or performance degradation, which for the majority of common users results unacceptable. RMT (Redundant Multi-Threading) is a family of techniques based on SMT processors in which two independent threads (master and slave), fed with the same inputs, redundantly execute the same instructions, in order to detect faults by checking their outputs. In this paper, we study the under-explored architectural support of RMT techniques to reliably execute shared-memory applications. We show how atomic operations induce to serialization points between master and slave threads. This bottleneck has an impact of 34% in execution time for several parallel scientific benchmarks. To address this issue, we present REPAS (Reliable execution of Parallel ApplicationS in tiled-CMPs), a novel RMT mechanism to provide reliable execution in shared-memory applications. While previous proposals achieve the same goal by using a big amount of hardware - usually, twice the number of cores in the system - REPAS architecture only needs a few extra hardware, since the redundant execution is made within 2-way SMT cores in which the majority of hardware is shared. Our evaluation shows that REPAS is able to provide full coverage against soft-errors with a lower performance slowdown in comparison to a non-redundant system than previous proposals at the same time it uses less hardware resources.