Improving last level cache locality by integrating loop and data transformations

Authors:
Wei Ding;Mahmut Kandemir
Affiliations:
The Pennsylvania State University, University Park;The Pennsylvania State University, University Park
Venue:
Proceedings of the International Conference on Computer-Aided Design
Year:
2012

Citing 22
Cited 0

Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Unifying data and control transformations for distributed shared-memory machines

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Data transformations for eliminating conflict misses

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
A compiler technique for improving whole-program locality

POPL '01 Proceedings of the 28th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design

Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design
A Loop Transformation Theory and an Algorithm to Maximize Parallelism

IEEE Transactions on Parallel and Distributed Systems
SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance

WOMPAT '01 Proceedings of the International Workshop on OpenMP Applications and Tools: OpenMP Shared Memory Parallel Programming
Efficient Parallelization using Combined Loop and Data Transformations

PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
Custom Data Layout for Memory Parallelism

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Integrating loop and data optimizations for locality within a constraint network based framework

ICCAD '05 Proceedings of the 2005 IEEE/ACM International conference on Computer-aided design
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Trade-offs in loop transformations

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model

CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction
Cache topology aware computation mapping for multicores

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
A framework for automatic parallelization, static and dynamic memory optimization in MPSoC platforms

Proceedings of the 47th Design Automation Conference
Automatic Loop Tiling for Direct Memory Access

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Optimizing Data Layouts for Parallel Computation on Multicores

PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Optimizing data locality using array tiling

Proceedings of the International Conference on Computer-Aided Design
Combined loop transformation and hierarchy allocation for data reuse optimization

Proceedings of the International Conference on Computer-Aided Design

Quantified Score

Hi-index	0.00

Visualization

Abstract

Motivated by the observation that most existing data locality optimizations do not specifically target shared last-level caches of emerging multicores and that even multicore-specific locality-oriented techniques employ either loop or data layout optimizations but not both, in this paper we present an integrated loop and data layout optimization strategy, with the goal of improving the last-level cache performance of multicores that execute multithreaded applications. We present a detailed mathematical formulation of our locality optimization strategy and present experimental data from our current implementation. Our results, collected using 14 application programs, clearly show that the proposed integrated approach is very successful in practice, and outperforms both pure loop optimization and pure data layout optimization based alternatives. Our results also indicate that the savings achieved increase with increased core count and larger data set sizes.