A data layout optimization framework for NUCA-based multicores

Authors:
Yuanrui Zhang;Wei Ding;Mahmut Kandemir;Jun Liu;Ohyoung Jang
Affiliations:
The Pennsylvania State University, University Park, PA;The Pennsylvania State University, University Park, PA;The Pennsylvania State University, University Park, PA;The Pennsylvania State University, University Park, PA;The Pennsylvania State University, University Park, PA
Venue:
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2011

Citing 35
Cited 3

A graph-oriented mapping strategy for a hypercube

C3P Proceedings of the third conference on Hypercube concurrent computers and applications: Architecture, software, computer systems, and general issues - Volume 1
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
PARADIGM: a compiler for automatic data distribution on multicomputers

ICS '93 Proceedings of the 7th international conference on Supercomputing
Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Automatic data layout for high performance Fortran

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
A novel approach towards automatic data distribution

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Non-singular data transformations: definition, validity and applications

ICS '97 Proceedings of the 11th international conference on Supercomputing
Data transformations for eliminating conflict misses

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts

IEEE Transactions on Parallel and Distributed Systems
Improving cache performance in dynamic applications through data and computation reorganization at run time

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
Transformations for imperfectly nested loops

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Route packets, not wires: on-chip inteconnection networks

Proceedings of the 38th annual Design Automation Conference
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Simics: A Full System Simulation Platform

Computer
A Loop Transformation Algorithm Based on Explicit Data Layout Representation for Optimizing Locality

LCPC '98 Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing
Automatic Support for Data Distribution on Distributed Memory Multiprocessor Systems

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance

WOMPAT '01 Proceedings of the International Workshop on OpenMP Applications and Tools: OpenMP Shared Memory Parallel Programming
Distance Associativity for High-Performance Energy-Efficient Non-Uniform Cache Architectures

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Custom Data Layout for Memory Parallelism

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Multi-objective mapping for mesh-based NoC architectures

Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Cache Conscious Data Layout Organization for Conflict Miss Reduction in Embedded Multimedia Applications

IEEE Transactions on Computers
Managing Wire Delay in Large Chip-Multiprocessor Caches

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Design and Management of 3D Chip Multiprocessors Using Network-in-Memory

Proceedings of the 33rd annual international symposium on Computer Architecture
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Application mapping for chip multiprocessors

Proceedings of the 45th annual Design Automation Conference
Comparison of memory write policies for NoC based multicore cache coherent systems

Proceedings of the conference on Design, automation and test in Europe
Integrated code and data placement in two-dimensional mesh based chip multiprocessors

Proceedings of the 2008 IEEE/ACM International Conference on Computer-Aided Design
Reactive NUCA: near-optimal block placement and replication in distributed caches

Proceedings of the 36th annual international symposium on Computer architecture
Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model

CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction
Neighborhood-aware data locality optimization for NoC-based multicores

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization

A Case-Based Framework for Interactive Capture and Reuse of Design Knowledge

Applied Intelligence
Off-chip access localization for NoC-based multicores

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Towards efficient dynamic LLC home bank mapping with noc-level support

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Future multicore architectures are likely to include a large number of cores connected using an on-chip network with Non-uniform Cache Access (NUCA). In such architectures, whether a data request is satisfied from a local cache or a remote cache can make an important difference. To exploit this NUCA property, prior research explored both architectural enhancements as well as compiler-based code optimization strategies. In this work, we take an alternate view, and explore data layout optimizations to improve locality of data accesses in a NUCA-based system. Our proposed approach includes three steps: array tiling, computation-to-core mapping, and layout customization. The first of these tries to identify the affinity between data and computation taking into account parallelization information, with the goal of minimizing remote accesses. The second step maps computations (and their associated data) to cores with the goal of minimizing average distance-to-data, and the last step further customizes the memory layout taking into account the data placement policy adopted by the underlying architecture. We evaluated the success of this three-step approach in enhancing on-chip cache behavior using all application programs from the SPECOMP suite on a full-system simulator. Our results show that the proposed approach improves on average data access latency and execution time by 24.7% and 18.4%, respectively, in the case of static NUCA, and 18.1% and 12.7%, respectively, in the case of dynamic NUCA.