The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
On Estimating and Enhancing Cache Effectiveness
Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Compiler transformations for high-performance computing
ACM Computing Surveys (CSUR)
An effective programmable prefetch engine for on-chip caches
Proceedings of the 28th annual international symposium on Microarchitecture
Examination of a memory access classification scheme for pointer-intensive and numeric programs
ICS '96 Proceedings of the 10th international conference on Supercomputing
Fusion of Loops for Parallelism and Locality
IEEE Transactions on Parallel and Distributed Systems
Cache miss equations: an analytical representation of cache misses
ICS '97 Proceedings of the 11th international conference on Supercomputing
Automatic selection of high-order transformations in the IBM XL FORTRAN compilers
IBM Journal of Research and Development - Special issue: performance analysis and its impact on design
Precise miss analysis for program transformations with caches of arbitrary associativity
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Cache miss equations: a compiler framework for analyzing and tuning memory behavior
ACM Transactions on Programming Languages and Systems (TOPLAS)
Locality optimizations for multi-level caches
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Automated cache optimizations using CME driven diagnosis
Proceedings of the 14th international conference on Supercomputing
Tiling optimizations for 3D scientific computations
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Reducing Cache Conflicts by Multi-Level Cache Partitioning and Array Elements Mapping
The Journal of Supercomputing
False Sharing Elimination by Selection of Runtime Scheduling Parameters
ICPP '97 Proceedings of the international Conference on Parallel Processing
A Compiler Framework for Tiling Imperfectly-Nested Loops
LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
A Quantitative Analysis of Tile Size Selection Algorithms
The Journal of Supercomputing
Efficient and Accurate Analytical Modeling of Whole-Program Data Cache Behavior
IEEE Transactions on Computers
Automatic tiling of iterative stencil loops
ACM Transactions on Programming Languages and Systems (TOPLAS)
Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy
Proceedings of the international symposium on Code generation and optimization
Eliminating Conflict Misses Using Prime Number-Based Cache Indexing
IEEE Transactions on Computers
Practical Structure Layout Optimization and Advice
Proceedings of the International Symposium on Code Generation and Optimization
Whole-program optimization of global variable layout
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Simultaneous minimization of capacity and conflict misses
Journal of Computer Science and Technology
B2P2: bounds based procedure placement for instruction TLB power reduction in embedded systems
Proceedings of the 13th International Workshop on Software & Compilers for Embedded Systems
Analysis of the spatial and temporal locality in data accesses
ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part II
Hi-index | 0.01 |
It has been observed that memory access performance can be improved by restructuring data declarations, using simple transformations such as array dimension padding and inter-array padding (array alignment) to reduce the number of misses in the cache and TLB (translation lookaside buffer). These transformations can be applied to both static and dynamic array variables. In this paper, we provide a padding algorithm for selecting appropriate padding amounts, which takes into account various cache and TLB effects collectively within a single framework. In addition to reducing the number of misses, we identify the importance of reducing the impact of cache miss jamming by spreading cache misses more uniformly across loop iterations.We translate undesirable cache and TLB behaviors into a set of constraints on padding amounts and propose a heuristic algorithm of polynomial time complexity to find the padding amounts to satisfy these constraints. The goal of the padding algorithm is to select padding amounts so that there are no set conflicts and no offset conflicts in the cache and TLB, for a given loop. In practice, this algorithm can efficiently find small padding amounts to satisfy these constraints.