The data locality of work stealing

  • Authors:
  • Umut A. Acar;Guy E. Blelloch;Robert D. Blumofe

  • Affiliations:
  • School of Computer Science, Carnegie Mellon University;School of Computer Science, Carnegie Mellon University;Department of Computer Sciences, University of Texas at Austin

  • Venue:
  • Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlled shared-memory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a locality-guided work-stealing algorithm along with experimental validation.As a lower bound, we show that there is a family of multi-threaded computations Gn each member of which requires &THgr;(n) total instructions (work) for which when using work-stealing the number of cache misses on one processor is constant, while even on two processors the total number of cache misses is &OHgr;(n). This implies that for general computations there is no useful bound relating multiprocessor to uninprocessor cache misses. For nested-parallel computations, however, we show that on P processors the expected additional number of cache misses beyond those on a single processor is bounded by O(C⌈m/s PT∞), where m is the execution time of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T∞ is the number of nodes on the longest chain of dependences. Based on this we give strong bounds on the total running time of nested-parallel computations using work stealing. For the second part of our results, we present a locality-guided work stealing algorithm that improves the data locality of multi-threaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative data-parallel applications show that the algorithm matches the performance of static-partitioning under traditional work loads but improves the performance up to 50% over static partitioning under multiprogrammed work loads. Furthermore, the locality-guided work stealing improves the performance of work-stealing up to 80%.