Computer architecture: a quantitative approach
Computer architecture: a quantitative approach
Symbolic analysis for parallelizing compilers
ACM Transactions on Programming Languages and Systems (TOPLAS)
Cache miss equations: an analytical representation of cache misses
ICS '97 Proceedings of the 11th international conference on Supercomputing
Analytical Modeling of Set-Associative Cache Behavior
IEEE Transactions on Computers
Exact analysis of the cache behavior of nested loops
Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Partitioning and Labeling of Loops by Unimodular Transformations
IEEE Transactions on Parallel and Distributed Systems
A compiler tool to predict memory hierarchy performance of scientific codes
Parallel Computing
A novel approach for partitioning iteration spaces with variable densities
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Queue - Multiprocessors
Software and the Concurrency Revolution
Queue - Multiprocessors
A general approach for partitioning N-dimensional parallel nested loops with conditionals
Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures
Proceedings of the 20th annual international conference on Supercomputing
A geometric approach for partitioning n-dimensional non-rectangular iteration spaces
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Optimizing shared cache behavior of chip multiprocessors
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Frameworks for multi-core architectures: a comprehensive evaluation using 2D/3D image registration
ARCS'11 Proceedings of the 24th international conference on Architecture of computing systems
Hi-index | 0.00 |
The need for high performance per watt has led to the development of multi-core systems such as the Intel Core 2 Duo processor and the Intel quad-core Kentsfield processor. Maximal exploitation of the hardware parallelism supported by such systems necessitates the development of concurrent software. This, in part, entails program parallelization and efficient mapping of the parallelized program onto the different cores. The latter affects the load balance between the different cores which in turn has a direct impact on performance. In light of the fact that parallel loops, such as a parallel DO loop in Fortran, account for a large percentage of the total execution time, we focus on the problem of how to efficiently partition the iteration space of (possibly) nested perfect/non-perfect parallel loops. In this regard, one of the key aspects is how to efficiently capture the cache behavior as the cache subsystem is often the main performance bottleneck in multi-core systems. In this paper, we present a novel profile-guided compiler technique for cache-aware partitioning of iteration spaces of parallel loops. We present a case study using a kernel from the industry-standard SPEC CPU benchmark suite.