Applying classic feedback control for enhancing the fault-tolerance of parallel pipeline workflows on multi-core systems

Authors:
Tudor B. Ionescu;Eckart Laurien;Walter Scheuermann
Affiliations:
Institute of Nuclear Technology and Energy Systems, Stuttgart, Germany;Institute of Nuclear Technology and Energy Systems, Stuttgart, Germany;Institute of Nuclear Technology and Energy Systems, Stuttgart, Germany
Venue:
Facing the multicore-challenge
Year:
2010

Citing 7
Cited 0

Quantitative system performance: computer system analysis using queueing network models

Quantitative system performance: computer system analysis using queueing network models
Distributed discrete-event simulation

ACM Computing Surveys (CSUR)
Feedback Control Real-Time Scheduling: Framework, Modeling, and Algorithms*

Real-Time Systems
Queueing Model Based Network Server Performance Control

RTSS '02 Proceedings of the 23rd IEEE Real-Time Systems Symposium
On Designing Improved Controllers for AQM Routers Supporting TCP Flows

On Designing Improved Controllers for AQM Routers Supporting TCP Flows
Multi-Criteria Scheduling of Pipeline Workflows (and Application To the JPEG Encoder)

International Journal of High Performance Computing Applications
Real-Time Systems and Programming Languages: Ada, Real-Time Java and C/Real-Time POSIX

Real-Time Systems and Programming Languages: Ada, Real-Time Java and C/Real-Time POSIX

Quantified Score

Hi-index	0.00

Visualization

Abstract

Nuclear disaster early warning systems are based on simulations of the atmospheric dispersion of the radioactive pollutants that may have been released into the atmosphere as a result of an accident at a nuclear power plant. Currently the calculation is performed by a series of 9 enchained FORTRAN and C/C++ sequential simulation codes. The new requirements to our example early warning system we focus on in this paper include a maximum response time of 120 seconds whereas currently computing a single simulation step exceeds this limit. For the purpose of improving performance we propose a pipeline parallelization of the simulation workflow on a multi-core system. This leads to a 4.5x speedup with respect to the sequential execution time on a dual quadcore machine. The scheduling problem which arises is that of maximizing the number of iterations of the dispersion calculation algorithm while not exceeding the maximum response time limit. In the context of our example application, a static scheduling strategy (e.g., a fixed rate of firing iterations) proves to be inappropriate because it is not able to tolerate faults that may occur during regular use (e.g., CPU failure, software errors, heavy load bursts). In this paper we show how a simple PI-controller is able to keep the realized response time of the workflow around a desired value in different failure and heavy load scenarios by automatically reducing the throughput of the system when necessary, thus improving the system's fault tolerance.