The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Tolerating latency through software-controlled prefetching in shared-memory multiprocessors
Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Unifying data and control transformations for distributed shared-memory machines
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Hitting the memory wall: implications of the obvious
ACM SIGARCH Computer Architecture News
Improving data locality with loop transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
A compiler algorithm for optimizing locality in loop nests
ICS '97 Proceedings of the 11th international conference on Supercomputing
Data transformations for eliminating conflict misses
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Precise miss analysis for program transformations with caches of arbitrary associativity
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Automatic Compiler-Inserted Prefetching for Pointer-Based Applications
IEEE Transactions on Computers - Special issue on cache memory and related problems
Augmenting Loop Tiling with Data Alignment for Improved Cache Performance
IEEE Transactions on Computers - Special issue on cache memory and related problems
Improving Cache Locality by a Combination of Loop and Data Transformations
IEEE Transactions on Computers - Special issue on cache memory and related problems
A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts
IEEE Transactions on Parallel and Distributed Systems
Nonlinear array layouts for hierarchical memory systems
ICS '99 Proceedings of the 13th international conference on Supercomputing
A tile selection algorithm for data locality and cache interference
ICS '99 Proceedings of the 13th international conference on Supercomputing
Recursive array layouts and fast parallel matrix multiplication
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Locality optimizations for multi-level caches
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
ICS '01 Proceedings of the 15th international conference on Supercomputing
Language support for Morton-order matrices
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Efficient Representation Scheme for Multidimensional Array Operations
IEEE Transactions on Computers
Software caching vs. prefetching
Proceedings of the 3rd international symposium on Memory management
Computer architecture: a quantitative approach
Computer architecture: a quantitative approach
A Layout-Conscious Iteration Space Transformation Technique
IEEE Transactions on Computers
Analysis of Memory Hierarchy Performance of Block Data Layout
ICPP '02 Proceedings of the 2002 International Conference on Parallel Processing
Tiling, Block Data Layout, and Memory Hierarchy Performance
IEEE Transactions on Parallel and Distributed Systems
Runtime adaptation: a case for reactive code alignment
Proceedings of the 2nd International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Hi-index | 0.00 |
Several studies have been conducted on blocked data layouts, in conjunction with loop tiling to improve locality of references. In this paper, we further reduce cache misses, restructuring the memory layout of multi-dimensional arrays, so that array elements are stored in a blocked way, exactly as they are swept by the tiled instruction stream. A straightforward way is presented to easily translate multi-dimensional indexing of arrays into their blocked memory layout using quick and simple binary-mask operations. Actual experimental results and simulations illustrate that performance is greatly improved because of the considerable reduction of cache misses.