Studying inter-core data reuse in multicores

Authors:
Yuanrui Zhang;Mahmut Kandemir;Taylan Yemliha
Affiliations:
Pennsylvania State University, University Park, PA, USA;Pennsylvania State University, University Park, PA, USA;Syracuse University, Syracuse, NY, USA
Venue:
Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Year:
2011

Citing 39
Cited 1

A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Compiling for numa parallel machines

Compiling for numa parallel machines
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
A compiler technique for improving whole-program locality

POPL '01 Proceedings of the 28th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Calculating stack distances efficiently

Proceedings of the 2002 workshop on Memory system performance
The SUIF Compiler System: a Parallelizing and Optimizing Research Compiler

The SUIF Compiler System: a Parallelizing and Optimizing Research Compiler
Dynamic Partitioning of Shared Cache Memory

The Journal of Supercomputing
Multi-objective mapping for mesh-based NoC architectures

Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Optimizing Replication, Communication, and Capacity Allocation in CMPs

Proceedings of the 32nd annual international symposium on Computer Architecture
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
A hierarchical model of data locality

Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Communication-aware allocation and scheduling framework for stream-oriented multi-processor systems-on-chip

Proceedings of the conference on Design, automation and test in Europe: Proceedings
Architectural support for operating system-driven CMP cache management

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
A flexible data to L2 cache mapping approach for future multicore processors

Proceedings of the 2006 workshop on Memory system performance and correctness
Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Scheduling threads for constructive cache sharing on CMPs

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Operating system scheduling for chip multithreaded processors

Operating system scheduling for chip multithreaded processors
Cooperative cache partitioning for chip multiprocessors

Proceedings of the 21st annual international conference on Supercomputing
Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Adaptive set pinning: managing shared caches in chip multiprocessors

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Application mapping for chip multiprocessors

Proceedings of the 45th annual Design Automation Conference
User-aware dynamic task allocation in networks-on-chip

Proceedings of the conference on Design, automation and test in Europe
Second Generation Quad-Core Intel Xeon Processors Bring 45 nm Technology and a New Level of Performance to HPC Applications

ICCS '08 Proceedings of the 8th international conference on Computational Science, Part I
Using OS Observations to Improve Performance in Multicore Systems

IEEE Micro
Adaptive insertion policies for managing shared caches

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Balancing Locality and Parallelism on Shared-cache Mulit-core Systems

HPCC '09 Proceedings of the 2009 11th IEEE International Conference on High Performance Computing and Communications
Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Parallel Data Reuse Theory for OpenMP Applications

SNPD '09 Proceedings of the 2009 10th ACIS International Conference on Software Engineering, Artificial Intelligences, Networking and Parallel/Distributed Computing
Evaluation techniques for storage hierarchies

IBM Systems Journal
LRU stack processing

IBM Journal of Research and Development
Optimizing shared cache behavior of chip multiprocessors

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Intra-application shared cache partitioning for multithreaded applications

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Compiler techniques for reducing data cache miss rate on a multithreaded architecture

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
Cache topology aware computation mapping for multicores

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Accelerating multicore reuse distance analysis with sampling and parallelization

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Is reuse distance applicable to data locality analysis on chip multiprocessors?

CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction

Performance and energy efficiency analysis of data reuse transformation methodology on multicore processor

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most of existing research on emerging multicore machines focus on parallelism extraction and architectural level optimizations. While these optimizations are critical, complementary approaches such as data locality enhancement can also bring significant benefits. Most of the previous data locality optimization techniques have been proposed and evaluated in the context of single core architectures. While one can expect these optimizations to be useful for multicore machines as well, multicores present further opportunities due to shared on-chip caches most of them accommodate. In order to optimize data locality targeting multicore machines however, the first step is to understand data reuse characteristics of multithreaded applications and potential benefits shared caches can bring. Motivated by these observations, we make the following contributions in this paper. First, we give a definition for inter-core data reuse and quantify it on multicores using a set of ten multithreaded application programs. Second, we show that neither on-chip cache hierarchies of current multicore architectures nor state-of-the-art (single-core centric) code/data optimizations exploit available inter-core data reuse in multithreaded applications. Third, we demonstrate that exploiting all available intercore reuse could boost overall application performance by around 21.3% on average, indicating that there is significant scope for optimization. However, we also show that trying to optimize for inter-core reuse aggressively without considering the impact of doing so on intra-core reuse can actually perform worse than optimizing for intra-core reuse alone. Finally, we present a novel, compiler-based data locality optimization strategy for multicores that balances both inter-core and intra-core reuse optimizations carefully to maximize benefits that can be extracted from shared caches. Our experiments with this strategy reveal that it is very effective in optimizing data locality in multicores.