On-chip cache hierarchy-aware tile scheduling for multicore machines

Authors:
Jun Liu;Yuanrui Zhang;Wei Ding;Mahmut Kandemir
Affiliations:
Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA;Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA;Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA;Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA
Venue:
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Year:
2011

Citing 33
Cited 3

Theory of linear and integer programming

Theory of linear and integer programming
Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Iteration space tiling for distributed memory machines

Languages, compilers and run-time environments for distributed memory machines
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Maximizing parallelism and minimizing synchronization with affine transforms

Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
New tiling techniques to improve cache temporal locality

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
An affine partitioning algorithm to maximize parallelism and minimize communication

ICS '99 Proceedings of the 13th international conference on Supercomputing
Selecting tile shape for minimal execution time

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Reuse-driven tiling for improving data locality

International Journal of Parallel Programming
Compiler algorithms for optimizing locality and parallelism on shared and distributed-memory machines

Journal of Parallel and Distributed Computing
Blocking and array contraction across arbitrarily nested loops using affine partitioning

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Data Dependence and Data-Flow Analysis of Arrays

Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
Code Generation in the Polyhedral Model Is Easier Than You Think

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Facilitating the search for compositions of program transformations

Proceedings of the 19th annual international conference on Supercomputing
An analytical model for loop tiling and its solution

ISPASS '00 Proceedings of the 2000 IEEE International Symposium on Performance Analysis of Systems and Software
Compilation for explicitly managed memory hierarchies

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time

Proceedings of the International Symposium on Code Generation and Optimization
Counting Integer Points in Parametric Polytopes Using Barvinok's Rational Functions

Algorithmica
A practical automatic polyhedral parallelizer and locality optimizer

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Adaptive Loop Tiling for a Multi-cluster CMP

ICA3PP '08 Proceedings of the 8th international conference on Algorithms and Architectures for Parallel Processing
Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Cache-aware partitioning of multi-dimensional iteration spaces

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
Parametric multi-level tiling of imperfectly nested loops

Proceedings of the 23rd international conference on Supercomputing
Optimizing shared cache behavior of chip multiprocessors

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Design and use of htalib: a library for hierarchically tiled arrays

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model

CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction
Cache topology aware computation mapping for multicores

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Hierarchically tiled arrays for parallelism and locality

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Selecting the tile shape to reduce the total communication volume

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Efficient code generation for automatic parallelization and optimization

ISPDC'03 Proceedings of the Second international conference on Parallel and distributed computing

Hierarchical overlapped tiling

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Accurate prediction of the behavior of multithreaded applications in shared caches

Parallel Computing
Reshaping cache misses to improve row-buffer locality in multicore systems

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.00

Visualization

Abstract

Iteration space tiling and scheduling is an important technique for optimizing loops that constitute a large fraction of execution times in computation kernels of both scientific codes and embedded applications. While tiling has been studied extensively in the context of both uniprocessor and multiprocessor platforms, prior research has paid less attention to tile scheduling, especially when targeting multicore machines with deep on-chip cache hierarchies. In this paper, we propose a cache hierarchy-aware tile scheduling algorithm for multicore machines, with the purpose of maximizing both horizontal and vertical data reuses in on-chip caches, and balancing the workloads across different cores. This scheduling algorithm is one of the key components in a source-to-source translation tool that we developed for automatic loop parallelization and multithreaded code generation from sequential codes. To the best of our knowledge, this is the first effort that develops a fully-automated tile scheduling strategy customized for on-chip cache topologies of multicore machines. The experimental results collected by executing twelve application programs on three commercial Intel machines (Nehalem, Dunnington, and Harpertown) reveal that our cache-aware tile scheduling brings about 27.9% reduction in cache misses, and on average, 13.5% improvement in execution times over an alternate method tested.