Theory of linear and integer programming
Theory of linear and integer programming
POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Proceedings of the 1989 ACM/IEEE conference on Supercomputing
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Iteration space tiling for distributed memory machines
Languages, compilers and run-time environments for distributed memory machines
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data-centric multi-level blocking
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Maximizing parallelism and minimizing synchronization with affine transforms
Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
New tiling techniques to improve cache temporal locality
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
An affine partitioning algorithm to maximize parallelism and minimize communication
ICS '99 Proceedings of the 13th international conference on Supercomputing
Selecting tile shape for minimal execution time
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Reuse-driven tiling for improving data locality
International Journal of Parallel Programming
Journal of Parallel and Distributed Computing
Blocking and array contraction across arbitrarily nested loops using affine partitioning
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Data Dependence and Data-Flow Analysis of Arrays
Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
Code Generation in the Polyhedral Model Is Easier Than You Think
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Facilitating the search for compositions of program transformations
Proceedings of the 19th annual international conference on Supercomputing
An analytical model for loop tiling and its solution
ISPASS '00 Proceedings of the 2000 IEEE International Symposium on Performance Analysis of Systems and Software
Compilation for explicitly managed memory hierarchies
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time
Proceedings of the International Symposium on Code Generation and Optimization
A practical automatic polyhedral parallelizer and locality optimizer
Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Adaptive Loop Tiling for a Multi-cluster CMP
ICA3PP '08 Proceedings of the 8th international conference on Algorithms and Architectures for Parallel Processing
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Cache-aware partitioning of multi-dimensional iteration spaces
SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
Parametric multi-level tiling of imperfectly nested loops
Proceedings of the 23rd international conference on Supercomputing
Optimizing shared cache behavior of chip multiprocessors
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Design and use of htalib: a library for hierarchically tiled arrays
LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction
Cache topology aware computation mapping for multicores
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Hierarchically tiled arrays for parallelism and locality
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Selecting the tile shape to reduce the total communication volume
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Efficient code generation for automatic parallelization and optimization
ISPDC'03 Proceedings of the Second international conference on Parallel and distributed computing
Hierarchical overlapped tiling
Proceedings of the Tenth International Symposium on Code Generation and Optimization
Reshaping cache misses to improve row-buffer locality in multicore systems
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Hi-index | 0.00 |
Iteration space tiling and scheduling is an important technique for optimizing loops that constitute a large fraction of execution times in computation kernels of both scientific codes and embedded applications. While tiling has been studied extensively in the context of both uniprocessor and multiprocessor platforms, prior research has paid less attention to tile scheduling, especially when targeting multicore machines with deep on-chip cache hierarchies. In this paper, we propose a cache hierarchy-aware tile scheduling algorithm for multicore machines, with the purpose of maximizing both horizontal and vertical data reuses in on-chip caches, and balancing the workloads across different cores. This scheduling algorithm is one of the key components in a source-to-source translation tool that we developed for automatic loop parallelization and multithreaded code generation from sequential codes. To the best of our knowledge, this is the first effort that develops a fully-automated tile scheduling strategy customized for on-chip cache topologies of multicore machines. The experimental results collected by executing twelve application programs on three commercial Intel machines (Nehalem, Dunnington, and Harpertown) reveal that our cache-aware tile scheduling brings about 27.9% reduction in cache misses, and on average, 13.5% improvement in execution times over an alternate method tested.