Neighborhood-aware data locality optimization for NoC-based multicores

Authors:
Mahmut Kandemir;Yuanrui Zhang;Jun Liu;Taylan Yemliha
Affiliations:
Department of Computer Science and Engineering, The Pennsylvania State University, University Park, USA;Department of Computer Science and Engineering, The Pennsylvania State University, University Park, USA;Department of Computer Science and Engineering, The Pennsylvania State University, University Park, USA;Department of Computer Science, Syracuse University, USA
Venue:
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Year:
2011

Citing 28
Cited 2

The Omega Library interface guide

The Omega Library interface guide
Combining loop transformations considering caches and scheduling

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Route packets, not wires: on-chip inteconnection networks

Proceedings of the 38th annual Design Automation Conference
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
Computer Networking: A Top-Down Approach Featuring the Internet

Computer Networking: A Top-Down Approach Featuring the Internet
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Simics: A Full System Simulation Platform

Computer
Distance Associativity for High-Performance Energy-Efficient Non-Uniform Cache Architectures

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Multi-objective mapping for mesh-based NoC architectures

Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Managing Wire Delay in Large Chip-Multiprocessor Caches

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
A Modular Simulation Framework for Spatial and Temporal Task Mapping onto Multi-Processor SoC Platforms

Proceedings of the conference on Design, Automation and Test in Europe - Volume 2
Optimizing Replication, Communication, and Capacity Allocation in CMPs

Proceedings of the 32nd annual international symposium on Computer Architecture
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
A hierarchical model of data locality

Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Communication-aware allocation and scheduling framework for stream-oriented multi-processor systems-on-chip

Proceedings of the conference on Design, automation and test in Europe: Proceedings
Integrated scratchpad memory optimization and task scheduling for MPSoC architectures

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
A flexible data to L2 cache mapping approach for future multicore processors

Proceedings of the 2006 workshop on Memory system performance and correctness
Scheduling threads for constructive cache sharing on CMPs

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Application mapping for chip multiprocessors

Proceedings of the 45th annual Design Automation Conference
Comparison of memory write policies for NoC based multicore cache coherent systems

Proceedings of the conference on Design, automation and test in Europe
User-aware dynamic task allocation in networks-on-chip

Proceedings of the conference on Design, automation and test in Europe
Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Optimizing shared cache behavior of chip multiprocessors

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Compiler techniques for reducing data cache miss rate on a multithreaded architecture

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
Cache topology aware computation mapping for multicores

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation

A data layout optimization framework for NUCA-based multicores

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Handling global traffic in future CMP NoCs

Proceedings of the International Workshop on System Level Interconnect Prediction

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data locality optimization is a critical issue for NoC (network-on-chip) based multicore systems. In this paper, focusing on a two-dimensional NoC-based multicore and dataintensive multithreaded applications, we first discuss a data locality aware scheduling algorithm for any given computation-to-core mapping, and then propose an integrated mapping+scheduling algorithm that performs both tasks together. Both our algorithms consider temporal (time-wise) and spatial (neighborhood-aware) data reuse, and try to minimize distance-to-data in on-chip cache accesses. We test the effectiveness of our compiler algorithms using a set of twelve application programs. Our experiments indicate that the proposed algorithms achieve significant improvements in data access latencies (42.7% on average) and overall execution times (24.1% on average). We also conduct a sensitivity analysis where we change the number of cores, on-chip cache capacities, and data movement (migration) strategies. These experiments show that our proposed algorithms generate consistently good results.