Cache topology aware computation mapping for multicores

Authors:
Mahmut Kandemir;Taylan Yemliha;SaiPrashanth Muralidhara;Shekhar Srikantaiah;Mary Jane Irwin;Yuanrui Zhnag
Affiliations:
The Pennsylvania State University, University Park, PA, USA;Syracuse University, Syracuse, NY, USA;The Pennsylvania State University, University Park, PA, USA;The Pennsylvania State University, University Park, PA, USA;The Pennsylvania State University, University Park, PA, USA;The Pennsylvania State University, University Park, PA, USA
Venue:
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Year:
2010

Citing 35
Cited 12

The Omega Library interface guide

The Omega Library interface guide
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Simics: A Full System Simulation Platform

Computer
Automatic computation and data decomposition for multiprocessors

Automatic computation and data decomposition for multiprocessors
Dynamic Partitioning of Shared Cache Memory

The Journal of Supercomputing
Mapping and Load-Balancing Iterative Computations

IEEE Transactions on Parallel and Distributed Systems
Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Managing Wire Delay in Large Chip-Multiprocessor Caches

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors

Proceedings of the 32nd annual international symposium on Computer Architecture
Adaptive Mechanisms and Policies for Managing Cache Hierarchies in Chip Multiprocessors

Proceedings of the 32nd annual international symposium on Computer Architecture
Organizing the Last Line of Defense before Hitting the Memory Wall for CMPs

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
A hierarchical model of data locality

Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Compiler-directed channel allocation for saving power in on-chip networks

Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Cooperative Caching for Chip Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
Configurable cache subsetting for fast cache tuning

Proceedings of the 43rd annual Design Automation Conference
Architectural support for operating system-driven CMP cache management

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Communist, utilitarian, and capitalist cache policies on CMPs: caches as a shared resource

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
SPEC CPU2006 benchmark descriptions

ACM SIGARCH Computer Architecture News
Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
ASR: Adaptive Selective Replication for CMP Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Scalable and structured scheduling

International Journal of Parallel Programming
Scheduling threads for constructive cache sharing on CMPs

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
QoS policies and architecture for cache/memory in CMP platforms

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Locality and Loop Scheduling on NUMA Multiprocessors

ICPP '93 Proceedings of the 1993 International Conference on Parallel Processing - Volume 02
Cooperative cache partitioning for chip multiprocessors

Proceedings of the 21st annual international conference on Supercomputing
A reusability-aware cache memory sharing technique for high-performance low-power CMPs with private L2 caches

ISLPED '07 Proceedings of the 2007 international symposium on Low power electronics and design
Adaptive set pinning: managing shared caches in chip multiprocessors

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
The Parma Polyhedra Library: Toward a complete set of numerical abstractions for the analysis and verification of hardware and software systems

Science of Computer Programming
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Coordinated management of multiple interacting resources in chip multiprocessors: A machine learning approach

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Lightweight annotations for controlling sharing in concurrent data structures

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Optimizing shared cache behavior of chip multiprocessors

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Compiler techniques for reducing data cache miss rate on a multithreaded architecture

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers

Studying inter-core data reuse in multicores

Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Studying inter-core data reuse in multicores

ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
PLDS: Partitioning linked data structures for parallelism

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Improving shared cache behavior of multithreaded object-oriented applications in multicores

Proceedings of the International Conference on Computer-Aided Design
On-chip cache hierarchy-aware tile scheduling for multicore machines

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Neighborhood-aware data locality optimization for NoC-based multicores

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Compiling for niceness: mitigating contention for QoS in warehouse scale computers

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Matching memory access patterns and data placement for NUMA systems

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Can traditional programming bridge the Ninja performance gap for parallel computing applications?

Proceedings of the 39th Annual International Symposium on Computer Architecture
Improving last level cache locality by integrating loop and data transformations

Proceedings of the International Conference on Computer-Aided Design
Locality-aware task management for unstructured parallelism: a quantitative limit study

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Dynamic thread pinning for phase-based OpenMP programs

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The main contribution of this paper is a compiler based, cache topology aware code optimization scheme for emerging multicore systems. This scheme distributes the iterations of a loop to be executed in parallel across the cores of a target multicore machine and schedules the iterations assigned to each core. Our goal is to improve the utilization of the on-chip multi-layer cache hierarchy and to maximize overall application performance. We evaluate our cache topology aware approach using a set of twelve applications and three different commercial multicore machines. In addition, to study some of our experimental parameters in detail and to explore future multicore machines (with higher core counts and deeper on-chip cache hierarchies), we also conduct a simulation based study. The results collected from our experiments with three Intel multicore machines show that the proposed compiler-based approach is very effective in enhancing performance. In addition, our simulation results indicate that optimizing for the on-chip cache hierarchy will be even more important in future multicores with increasing numbers of cores and cache levels.