Applying classic feedback control for enhancing the fault-tolerance of parallel pipeline workflows on multi-core systems

  • Authors:
  • Tudor B. Ionescu;Eckart Laurien;Walter Scheuermann

  • Affiliations:
  • Institute of Nuclear Technology and Energy Systems, Stuttgart, Germany;Institute of Nuclear Technology and Energy Systems, Stuttgart, Germany;Institute of Nuclear Technology and Energy Systems, Stuttgart, Germany

  • Venue:
  • Facing the multicore-challenge
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Nuclear disaster early warning systems are based on simulations of the atmospheric dispersion of the radioactive pollutants that may have been released into the atmosphere as a result of an accident at a nuclear power plant. Currently the calculation is performed by a series of 9 enchained FORTRAN and C/C++ sequential simulation codes. The new requirements to our example early warning system we focus on in this paper include a maximum response time of 120 seconds whereas currently computing a single simulation step exceeds this limit. For the purpose of improving performance we propose a pipeline parallelization of the simulation workflow on a multi-core system. This leads to a 4.5x speedup with respect to the sequential execution time on a dual quadcore machine. The scheduling problem which arises is that of maximizing the number of iterations of the dispersion calculation algorithm while not exceeding the maximum response time limit. In the context of our example application, a static scheduling strategy (e.g., a fixed rate of firing iterations) proves to be inappropriate because it is not able to tolerate faults that may occur during regular use (e.g., CPU failure, software errors, heavy load bursts). In this paper we show how a simple PI-controller is able to keep the realized response time of the workflow around a desired value in different failure and heavy load scenarios by automatically reducing the throughput of the system when necessary, thus improving the system's fault tolerance.