A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Compiling for numa parallel machines
Compiling for numa parallel machines
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
A compiler technique for improving whole-program locality
POPL '01 Proceedings of the 28th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Calculating stack distances efficiently
Proceedings of the 2002 workshop on Memory system performance
The SUIF Compiler System: a Parallelizing and Optimizing Research Compiler
The SUIF Compiler System: a Parallelizing and Optimizing Research Compiler
Dynamic Partitioning of Shared Cache Memory
The Journal of Supercomputing
Multi-objective mapping for mesh-based NoC architectures
Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Optimizing Replication, Communication, and Capacity Allocation in CMPs
Proceedings of the 32nd annual international symposium on Computer Architecture
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
A hierarchical model of data locality
Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Proceedings of the conference on Design, automation and test in Europe: Proceedings
Architectural support for operating system-driven CMP cache management
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
A flexible data to L2 cache mapping approach for future multicore processors
Proceedings of the 2006 workshop on Memory system performance and correctness
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Scheduling threads for constructive cache sharing on CMPs
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Operating system scheduling for chip multithreaded processors
Operating system scheduling for chip multithreaded processors
Cooperative cache partitioning for chip multiprocessors
Proceedings of the 21st annual international conference on Supercomputing
Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Adaptive set pinning: managing shared caches in chip multiprocessors
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Application mapping for chip multiprocessors
Proceedings of the 45th annual Design Automation Conference
User-aware dynamic task allocation in networks-on-chip
Proceedings of the conference on Design, automation and test in Europe
ICCS '08 Proceedings of the 8th international conference on Computational Science, Part I
Adaptive insertion policies for managing shared caches
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Balancing Locality and Parallelism on Shared-cache Mulit-core Systems
HPCC '09 Proceedings of the 2009 11th IEEE International Conference on High Performance Computing and Communications
Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors
PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Parallel Data Reuse Theory for OpenMP Applications
SNPD '09 Proceedings of the 2009 10th ACIS International Conference on Software Engineering, Artificial Intelligences, Networking and Parallel/Distributed Computing
Evaluation techniques for storage hierarchies
IBM Systems Journal
IBM Journal of Research and Development
Optimizing shared cache behavior of chip multiprocessors
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Intra-application shared cache partitioning for multithreaded applications
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Compiler techniques for reducing data cache miss rate on a multithreaded architecture
HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
Cache topology aware computation mapping for multicores
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Accelerating multicore reuse distance analysis with sampling and parallelization
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Is reuse distance applicable to data locality analysis on chip multiprocessors?
CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Hi-index | 0.00 |
Most of existing research on emerging multicore machines focus on parallelism extraction and architectural level optimizations. While these optimizations are critical, complementary approaches such as data locality enhancement can also bring significant benefits. Most of the previous data locality optimization techniques have been proposed and evaluated in the context of single core architectures. While one can expect these optimizations to be useful for multicore machines as well, multicores present further opportunities due to shared on-chip caches most of them accommodate. In order to optimize data locality targeting multicore machines however, the first step is to understand data reuse characteristics of multithreaded applications and potential benefits shared caches can bring. Motivated by these observations, we make the following contributions in this paper. First, we give a definition for inter-core data reuse and quantify it on multicores using a set of ten multithreaded application programs. Second, we show that neither on-chip cache hierarchies of current multicore architectures nor state-of-the-art (single-core centric) code/data optimizations exploit available inter-core data reuse in multithreaded applications. Third, we demonstrate that exploiting all available intercore reuse could boost overall application performance by around 21.3% on average, indicating that there is significant scope for optimization. However, we also show that trying to optimize for inter-core reuse aggressively without considering the impact of doing so on intra-core reuse can actually perform worse than optimizing for intra-core reuse alone. Finally, we present a novel, compiler-based data locality optimization strategy for multicores that balances both inter-core and intra-core reuse optimizations carefully to maximize benefits that can be extracted from shared caches. Our experiments with this strategy reveal that it is very effective in optimizing data locality in multicores.