Reshaping cache misses to improve row-buffer locality in multicore systems

Authors:
Wei Ding;Jun Liu;Mahmut Kandemir;Mary Jane Irwin
Affiliations:
The Pennsylvania State University, University Park, PA, USA;The Pennsylvania State University, University Park, PA, USA;The Pennsylvania State University, University Park, PA, USA;The Pennsylvania State University, University Park, PA, USA
Venue:
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Year:
2013

Citing 29
Cited 0

Theory of linear and integer programming

Theory of linear and integer programming
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Unifying data and control transformations for distributed shared-memory machines

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Compiler-directed page coloring for multiprocessors

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Non-singular data transformations: definition, validity and applications

ICS '97 Proceedings of the 11th international conference on Supercomputing
Cache miss equations: an analytical representation of cache misses

ICS '97 Proceedings of the 11th international conference on Supercomputing
Data transformations for eliminating conflict misses

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Improving cache performance in dynamic applications through data and computation reorganization at run time

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
Memory access scheduling

Proceedings of the 27th annual international symposium on Computer architecture
Transformations for imperfectly nested loops

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance

WOMPAT '01 Proceedings of the International Workshop on OpenMP Applications and Tools: OpenMP Shared Memory Parallel Programming
Custom Data Layout for Memory Parallelism

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
A hierarchical model of data locality

Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A flexible data to L2 cache mapping approach for future multicore processors

Proceedings of the 2006 workshop on Memory system performance and correctness
Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Self-Optimizing Memory Controllers: A Reinforcement Learning Approach

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Complexity effective memory access scheduling for many-core accelerator architectures

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Micro-pages: increasing DRAM efficiency with locality-aware data placement

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Compiler techniques for reducing data cache miss rate on a multithreaded architecture

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model

CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Optimizing Data Layouts for Parallel Computation on Multicores

PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Row-Buffer Reorganization: Simultaneously Improving Performance and Reducing Energy in DRAMs

PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
On-chip cache hierarchy-aware tile scheduling for multicore machines

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Staged memory scheduling: achieving high performance and scalability in heterogeneous systems

Proceedings of the 39th Annual International Symposium on Computer Architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Optimizing cache locality has always been important since the emergence of caches, and numerous cache locality optimization schemes have been published in compiler literature. However, in modern architectures, cache locality is not the only factor that determines memory system performance. Many emerging multicores employ banked memory systems and each bank is attached a row-buffer that holds the most-recently accessed memory row (page). A last-level cache miss that also misses in the row-buffer can experience much higher latency than a cache miss that hits in the row-buffer. Consequently, optimizing for row-buffer locality can be as important as optimizing for cache locality. Targeting emerging multicores and multithreaded applications, this paper presents a compiler-directed row-buffer locality optimization strategy. This strategy modifies the memory layout of data to increase the number of row-buffer hits without increasing the number of misses in the on-chip cache hierarchy. We implemented our proposed optimization strategy in an open-source compiler and tested its effectiveness in improving the row-buffer performance using a set of multithreaded applications. Our results indicate that the proposed approach improves the average data access latency by about 29%, and this translates, on average, to about 15% improvement in execution time.