Theory of linear and integer programming
Theory of linear and integer programming
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Unifying data and control transformations for distributed shared-memory machines
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data and computation transformations for multiprocessors
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Compiler-directed page coloring for multiprocessors
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Non-singular data transformations: definition, validity and applications
ICS '97 Proceedings of the 11th international conference on Supercomputing
Cache miss equations: an analytical representation of cache misses
ICS '97 Proceedings of the 11th international conference on Supercomputing
Data transformations for eliminating conflict misses
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Nonlinear array layouts for hierarchical memory systems
ICS '99 Proceedings of the 13th international conference on Supercomputing
Proceedings of the 27th annual international symposium on Computer architecture
Transformations for imperfectly nested loops
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance
WOMPAT '01 Proceedings of the International Workshop on OpenMP Applications and Tools: OpenMP Shared Memory Parallel Programming
Custom Data Layout for Memory Parallelism
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
A hierarchical model of data locality
Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A flexible data to L2 cache mapping approach for future multicore processors
Proceedings of the 2006 workshop on Memory system performance and correctness
Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Self-Optimizing Memory Controllers: A Reinforcement Learning Approach
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors
PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Complexity effective memory access scheduling for many-core accelerator architectures
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Micro-pages: increasing DRAM efficiency with locality-aware data placement
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Compiler techniques for reducing data cache miss rate on a multithreaded architecture
HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Optimizing Data Layouts for Parallel Computation on Multicores
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Row-Buffer Reorganization: Simultaneously Improving Performance and Reducing Energy in DRAMs
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
On-chip cache hierarchy-aware tile scheduling for multicore machines
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Staged memory scheduling: achieving high performance and scalability in heterogeneous systems
Proceedings of the 39th Annual International Symposium on Computer Architecture
Hi-index | 0.00 |
Optimizing cache locality has always been important since the emergence of caches, and numerous cache locality optimization schemes have been published in compiler literature. However, in modern architectures, cache locality is not the only factor that determines memory system performance. Many emerging multicores employ banked memory systems and each bank is attached a row-buffer that holds the most-recently accessed memory row (page). A last-level cache miss that also misses in the row-buffer can experience much higher latency than a cache miss that hits in the row-buffer. Consequently, optimizing for row-buffer locality can be as important as optimizing for cache locality. Targeting emerging multicores and multithreaded applications, this paper presents a compiler-directed row-buffer locality optimization strategy. This strategy modifies the memory layout of data to increase the number of row-buffer hits without increasing the number of misses in the on-chip cache hierarchy. We implemented our proposed optimization strategy in an open-source compiler and tested its effectiveness in improving the row-buffer performance using a set of multithreaded applications. Our results indicate that the proposed approach improves the average data access latency by about 29%, and this translates, on average, to about 15% improvement in execution time.