Cache-aware partitioning of multi-dimensional iteration spaces

  • Authors:
  • Arun Kejariwal;Alexandru Nicolau;Utpal Banerjee;Alexander V. Veidenbaum;Constantine D. Polychronopoulos

  • Affiliations:
  • Yahoo! Inc., Santa Clara, CA;University of California, Irvine, CA;University of California, Irvine, CA;University of California, Irvine, CA;University of Illinois at Urbana-Champaign, Urbana, IL

  • Venue:
  • SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

The need for high performance per watt has led to development of multi-core systems such as the Intel Core 2 Duo processor and the Intel quad-core Kentsfield processor. Maximal exploitation of the hardware parallelism supported by such systems necessitates the development of concurrent software. This, in part, entails automatic parallelization of programs and efficient mapping of the parallelized program onto the different cores. The latter affects the load balance between the different cores which in turn has a direct impact on performance. In light of the fact that, parallel loops, such as a parallel DO loop in Fortran, account for a large percentage of the total execution time, we focus on the problem of how to efficiently partition the iteration space of (possibly) nested perfect/non-perfect parallel loops. In this regard, one of the key aspects is how to efficiently capture the cache behavior as the cache subsystem is often the main performance bottleneck in multi-core systems. In this paper, we present a novel profile-guided compiler technique for cache-aware scheduling of iteration spaces of such loops. Specifically, we propose a technique for iteration space scheduling which captures the effect of variation in the number of cache misses across the iteration space. Subsequently, we propose a general approach to capture the variation of both the number of cache misses and computation across the iteration space. We demonstrate the efficacy of our approach on a dedicated 4-way Intel® Xeon® based multiprocessor using several kernels from the industry-standard SPEC CPU2000 and CPU2006 benchmarks achieving speedups upto 62.5%.