Optimizing shared cache behavior of chip multiprocessors

Authors:
Mahmut Kandemir;Sai Prashanth Muralidhara;Sri Hari Krishna Narayanan;Yuanrui Zhang;Ozcan Ozturk
Affiliations:
Pennsylvania State University;Pennsylvania State University;Pennsylvania State University;Pennsylvania State University;Bilkent University
Venue:
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2009

Citing 48
Cited 10

The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Counting solutions to Presburger formulas: how and why

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Compiling for numa parallel machines

Compiling for numa parallel machines
Compiler optimizations for improving data locality

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Cache miss equations: an analytical representation of cache misses

ICS '97 Proceedings of the 11th international conference on Supercomputing
Exact analysis of the cache behavior of nested loops

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Using locality surfaces to characterize the SPECint 2000 benchmark suite

Workload characterization of emerging computer applications
Parallel Computer Architecture: A Hardware/Software Approach

Parallel Computer Architecture: A Hardware/Software Approach
Introduction to Algorithms

Introduction to Algorithms
Data-Centric Transformations for Locality Enhancement

International Journal of Parallel Programming
Maximizing Multiprocessor Performance with the SUIF Compiler

Computer
Code Transformations for Data Transfer and Storage Exploration Preprocessing in Multimedia Processors

IEEE Design & Test
Partitioning and Labeling of Loops by Unimodular Transformations

IEEE Transactions on Parallel and Distributed Systems
Communication-Free Data Allocation Techniques for Parallelizing Compilers on Multicomputers

IEEE Transactions on Parallel and Distributed Systems
Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
A TDI System and Its Application to Approximation Algorithms

FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science
Unifying Data and Control Transformations for Distributed Shared Memory Machines

Unifying Data and Control Transformations for Distributed Shared Memory Machines
Dynamic Partitioning of Shared Cache Memory

The Journal of Supercomputing
A fast and accurate framework to analyze and optimize cache memory behavior

ACM Transactions on Programming Languages and Systems (TOPLAS)
Managing Wire Delay in Large Chip-Multiprocessor Caches

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Niagara: A 32-Way Multithreaded Sparc Processor

IEEE Micro
Optimizing Replication, Communication, and Capacity Allocation in CMPs

Proceedings of the 32nd annual international symposium on Computer Architecture
Dynamically configurable shared CMP helper engines for improved performance

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
A hierarchical model of data locality

Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Data and Computation Transformations for Brook Streaming Applications on Multiprocessors

Proceedings of the International Symposium on Code Generation and Optimization
Compiler-directed Data Partitioning for Multicluster Processors

Proceedings of the International Symposium on Code Generation and Optimization
Programming for parallelism and locality with hierarchically tiled arrays

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Adaptive designs for power and thermal optimization

ICCAD '05 Proceedings of the 2005 IEEE/ACM International conference on Computer-aided design
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
Architectural support for operating system-driven CMP cache management

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Scheduling threads for constructive cache sharing on CMPs

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Precise automatable analytical modeling of the cache behavior of codes with indirections

ACM Transactions on Architecture and Code Optimization (TACO)
Data locality enhancement for CMPs

Proceedings of the 2007 IEEE/ACM international conference on Computer-aided design
Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Cache-aware iteration space partitioning

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Exploiting access semantics and program behavior to reduce snoop power in chip multiprocessors

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Adaptive set pinning: managing shared caches in chip multiprocessors

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
IBM POWER6 microarchitecture

IBM Journal of Research and Development
IBM Power5 Chip: A Dual-Core Multithreaded Processor

IEEE Micro
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Engineering A Compiler

Engineering A Compiler
Operating Systems Concepts

Operating Systems Concepts
Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model

CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction

Cache topology aware computation mapping for multicores

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Studying inter-core data reuse in multicores

Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
The impact of memory subsystem resource sharing on datacenter applications

Proceedings of the 38th annual international symposium on Computer architecture
Studying inter-core data reuse in multicores

ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Improving shared cache behavior of multithreaded object-oriented applications in multicores

Proceedings of the International Conference on Computer-Aided Design
On-chip cache hierarchy-aware tile scheduling for multicore machines

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Neighborhood-aware data locality optimization for NoC-based multicores

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Compiling for niceness: mitigating contention for QoS in warehouse scale computers

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Accurate prediction of the behavior of multithreaded applications in shared caches

Parallel Computing
ReQoS: reactive static/dynamic compilation for QoS in warehouse scale computers

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the critical problems associated with emerging chip multiprocessors (CMPs) is the management of on-chip shared cache space. Unfortunately, single processor centric data locality optimization schemes may not work well in the CMP case as data accesses from multiple cores can create conflicts in the shared cache space. The main contribution of this paper is a compiler directed code restructuring scheme for enhancing locality of shared data in CMPs. The proposed scheme targets the last level shared cache that exist in many commercial CMPs and has two components, namely, allocation, which determines the set of loop iterations assigned to each core, and scheduling, which determines the order in which the iterations assigned to a core are executed. Our scheme restructures the application code such that the different cores operate on shared data blocks at the same time, to the extent allowed by data dependencies. This helps to reduce reuse distances for the shared data and improves on-chip cache performance. We evaluated our approach using the Splash-2 and Parsec applications through both simulations and experiments on two commercial multi-core machines. Our experimental evaluation indicates that the proposed data locality optimization scheme improves inter-core conflict misses in the shared cache by 67% on average when both allocation and scheduling are used. Also, the execution time improvements we achieve (29% on average) are very close to the optimal savings that could be achieved using a hypothetical scheme.