Advanced compiler optimizations for supercomputers
Communications of the ACM - Special issue on parallelism
ACM SIGARCH Computer Architecture News - Special Issue: Architectural Support for Operating Systems
Computer architecture: a quantitative approach
Computer architecture: a quantitative approach
Loop partitioning for distributed memory multiprocessors as unimodular transformations
ICS '91 Proceedings of the 5th international conference on Supercomputing
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Array privatization for parallel execution of loops
ICS '92 Proceedings of the 6th international conference on Supercomputing
PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Improving semi-static branch prediction by code replication
PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Partitioning and mapping of nested loops for linear array multicomputers
The Journal of Supercomputing - Special issue: trends in parallel operating systems
Symbolic analysis for parallelizing compilers
ACM Transactions on Programming Languages and Systems (TOPLAS)
Combining loop transformations considering caches and scheduling
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Exploiting hardware performance counters with flow and context sensitive profiling
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Cache miss equations: an analytical representation of cache misses
ICS '97 Proceedings of the 11th international conference on Supercomputing
Analytical Modeling of Set-Associative Cache Behavior
IEEE Transactions on Computers
Optimizing cache miss equations polyhedra
ACM SIGARCH Computer Architecture News - Special issue on interaction between compilers and computer architectures
Exact analysis of the cache behavior of nested loops
Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Optimizing Supercompilers for Supercomputers
Optimizing Supercompilers for Supercomputers
High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
A Loop Transformation Theory and an Algorithm to Maximize Parallelism
IEEE Transactions on Parallel and Distributed Systems
Partitioning and Labeling of Loops by Unimodular Transformations
IEEE Transactions on Parallel and Distributed Systems
On Estimating and Enhancing Cache Effectiveness
Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Parallel Program Graphs and their Classification
Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Evaluation of Loop Grouping Methods Based on Orthogonal Projection Spaces
ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
ACM SIGARCH Computer Architecture News
A compiler tool to predict memory hierarchy performance of scientific codes
Parallel Computing
A novel approach for partitioning iteration spaces with variable densities
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Queue - Multiprocessors
Software and the Concurrency Revolution
Queue - Multiprocessors
A general approach for partitioning N-dimensional parallel nested loops with conditionals
Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures
Online optimizations driven by hardware performance monitoring
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Nahalal: Cache Organization for Chip Multiprocessors
IEEE Computer Architecture Letters
Using platform-specific performance counters for dynamic compilation
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
A geometric approach for partitioning n-dimensional non-rectangular iteration spaces
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
On-chip cache hierarchy-aware tile scheduling for multicore machines
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Hi-index | 0.00 |
The need for high performance per watt has led to development of multi-core systems such as the Intel Core 2 Duo processor and the Intel quad-core Kentsfield processor. Maximal exploitation of the hardware parallelism supported by such systems necessitates the development of concurrent software. This, in part, entails automatic parallelization of programs and efficient mapping of the parallelized program onto the different cores. The latter affects the load balance between the different cores which in turn has a direct impact on performance. In light of the fact that, parallel loops, such as a parallel DO loop in Fortran, account for a large percentage of the total execution time, we focus on the problem of how to efficiently partition the iteration space of (possibly) nested perfect/non-perfect parallel loops. In this regard, one of the key aspects is how to efficiently capture the cache behavior as the cache subsystem is often the main performance bottleneck in multi-core systems. In this paper, we present a novel profile-guided compiler technique for cache-aware scheduling of iteration spaces of such loops. Specifically, we propose a technique for iteration space scheduling which captures the effect of variation in the number of cache misses across the iteration space. Subsequently, we propose a general approach to capture the variation of both the number of cache misses and computation across the iteration space. We demonstrate the efficacy of our approach on a dedicated 4-way Intel® Xeon® based multiprocessor using several kernels from the industry-standard SPEC CPU2000 and CPU2006 benchmarks achieving speedups upto 62.5%.