A work-stealing scheduling framework supporting fault tolerance

  • Authors:
  • Yizhuo Wang;Weixing Ji;Feng Shi;Qi Zuo

  • Affiliations:
  • Beijing Institute of Technology, Beijing, China;Beijing Institute of Technology, Beijing, China;Beijing Institute of Technology, Beijing, China;Beijing Institute of Technology, Beijing, China

  • Venue:
  • Proceedings of the Conference on Design, Automation and Test in Europe
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Fault tolerance and load balancing are critical points for executing long-running parallel applications on multicore clusters. This paper addresses both fault tolerance and load balancing on multicore clusters by presenting a novel work-stealing task scheduling framework which supports hardware fault tolerance. In this framework, both transient and permanent faults are detected and recovered at task granularity. We incorporate task-based fault detection and recovery mechanisms into a hierarchical work-stealing scheme to establish the framework. This framework provides low-overhead fault-tolerance and optimal load balancing by fully exploiting task parallelism.