Cache-aware partitioning of multi-dimensional iteration spaces

Authors:
Arun Kejariwal;Alexandru Nicolau;Utpal Banerjee;Alexander V. Veidenbaum;Constantine D. Polychronopoulos
Affiliations:
Yahoo! Inc., Santa Clara, CA;University of California, Irvine, CA;University of California, Irvine, CA;University of California, Irvine, CA;University of Illinois at Urbana-Champaign, Urbana, IL
Venue:
SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
Year:
2009

Citing 34
Cited 1

Advanced compiler optimizations for supercomputers

Communications of the ACM - Special issue on parallelism
An intelligent memory system

ACM SIGARCH Computer Architecture News - Special Issue: Architectural Support for Operating Systems
Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
Loop partitioning for distributed memory multiprocessors as unimodular transformations

ICS '91 Proceedings of the 5th international conference on Supercomputing
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Array privatization for parallel execution of loops

ICS '92 Proceedings of the 6th international conference on Supercomputing
Branch prediction for free

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Improving semi-static branch prediction by code replication

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Cache interference phenomena

SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Partitioning and mapping of nested loops for linear array multicomputers

The Journal of Supercomputing - Special issue: trends in parallel operating systems
Symbolic analysis for parallelizing compilers

ACM Transactions on Programming Languages and Systems (TOPLAS)
Combining loop transformations considering caches and scheduling

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Exploiting hardware performance counters with flow and context sensitive profiling

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Cache miss equations: an analytical representation of cache misses

ICS '97 Proceedings of the 11th international conference on Supercomputing
Analytical Modeling of Set-Associative Cache Behavior

IEEE Transactions on Computers
Optimizing cache miss equations polyhedra

ACM SIGARCH Computer Architecture News - Special issue on interaction between compilers and computer architectures
Exact analysis of the cache behavior of nested loops

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Optimizing Supercompilers for Supercomputers

Optimizing Supercompilers for Supercomputers
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
A Loop Transformation Theory and an Algorithm to Maximize Parallelism

IEEE Transactions on Parallel and Distributed Systems
Partitioning and Labeling of Loops by Unimodular Transformations

IEEE Transactions on Parallel and Distributed Systems
On Estimating and Enhancing Cache Effectiveness

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Parallel Program Graphs and their Classification

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Evaluation of Loop Grouping Methods Based on Orthogonal Projection Spaces

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
An algebraic memory model

ACM SIGARCH Computer Architecture News
A compiler tool to predict memory hierarchy performance of scientific codes

Parallel Computing
A novel approach for partitioning iteration spaces with variable densities

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
The Future of Microprocessors

Queue - Multiprocessors
Software and the Concurrency Revolution

Queue - Multiprocessors
A general approach for partitioning N-dimensional parallel nested loops with conditionals

Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures
Online optimizations driven by hardware performance monitoring

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Nahalal: Cache Organization for Chip Multiprocessors

IEEE Computer Architecture Letters
Using platform-specific performance counters for dynamic compilation

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
A geometric approach for partitioning n-dimensional non-rectangular iteration spaces

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing

On-chip cache hierarchy-aware tile scheduling for multicore machines

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

The need for high performance per watt has led to development of multi-core systems such as the Intel Core 2 Duo processor and the Intel quad-core Kentsfield processor. Maximal exploitation of the hardware parallelism supported by such systems necessitates the development of concurrent software. This, in part, entails automatic parallelization of programs and efficient mapping of the parallelized program onto the different cores. The latter affects the load balance between the different cores which in turn has a direct impact on performance. In light of the fact that, parallel loops, such as a parallel DO loop in Fortran, account for a large percentage of the total execution time, we focus on the problem of how to efficiently partition the iteration space of (possibly) nested perfect/non-perfect parallel loops. In this regard, one of the key aspects is how to efficiently capture the cache behavior as the cache subsystem is often the main performance bottleneck in multi-core systems. In this paper, we present a novel profile-guided compiler technique for cache-aware scheduling of iteration spaces of such loops. Specifically, we propose a technique for iteration space scheduling which captures the effect of variation in the number of cache misses across the iteration space. Subsequently, we propose a general approach to capture the variation of both the number of cache misses and computation across the iteration space. We demonstrate the efficacy of our approach on a dedicated 4-way Intel® Xeon® based multiprocessor using several kernels from the industry-standard SPEC CPU2000 and CPU2006 benchmarks achieving speedups upto 62.5%.