Fast indexing for blocked array layouts to reduce cache misses

Authors:
Evangelia Athanasaki;Nectarios Koziris
Affiliations:
Computing Systems Laboratory, School of Electrical and Computer Engineering, National Technical University of Athens, Greece.;Computing Systems Laboratory, School of Electrical and Computer Engineering, National Technical University of Athens, Greece
Venue:
International Journal of High Performance Computing and Networking
Year:
2005

Citing 28
Cited 1

The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Tolerating latency through software-controlled prefetching in shared-memory multiprocessors

Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Unifying data and control transformations for distributed shared-memory machines

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Hitting the memory wall: implications of the obvious

ACM SIGARCH Computer Architecture News
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
A compiler algorithm for optimizing locality in loop nests

ICS '97 Proceedings of the 11th international conference on Supercomputing
Data transformations for eliminating conflict misses

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Precise miss analysis for program transformations with caches of arbitrary associativity

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Automatic Compiler-Inserted Prefetching for Pointer-Based Applications

IEEE Transactions on Computers - Special issue on cache memory and related problems
Augmenting Loop Tiling with Data Alignment for Improved Cache Performance

IEEE Transactions on Computers - Special issue on cache memory and related problems
Improving Cache Locality by a Combination of Loop and Data Transformations

IEEE Transactions on Computers - Special issue on cache memory and related problems
A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts

IEEE Transactions on Parallel and Distributed Systems
Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
A tile selection algorithm for data locality and cache interference

ICS '99 Proceedings of the 13th international conference on Supercomputing
Recursive array layouts and fast parallel matrix multiplication

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Locality optimizations for multi-level caches

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Evaluating the impact of memory system performance on software prefetching and locality optimizations

ICS '01 Proceedings of the 15th international conference on Supercomputing
Language support for Morton-order matrices

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Efficient Representation Scheme for Multidimensional Array Operations

IEEE Transactions on Computers
Software caching vs. prefetching

Proceedings of the 3rd international symposium on Memory management
Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
Cache Profiling and the SPEC Benchmarks: A Case Study

Computer
A Layout-Conscious Iteration Space Transformation Technique

IEEE Transactions on Computers
Analysis of Memory Hierarchy Performance of Block Data Layout

ICPP '02 Proceedings of the 2002 International Conference on Parallel Processing
Tiling, Block Data Layout, and Memory Hierarchy Performance

IEEE Transactions on Parallel and Distributed Systems

Runtime adaptation: a case for reactive code alignment

Proceedings of the 2nd International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era

Quantified Score

Hi-index	0.00

Visualization

Abstract

Several studies have been conducted on blocked data layouts, in conjunction with loop tiling to improve locality of references. In this paper, we further reduce cache misses, restructuring the memory layout of multi-dimensional arrays, so that array elements are stored in a blocked way, exactly as they are swept by the tiled instruction stream. A straightforward way is presented to easily translate multi-dimensional indexing of arrays into their blocked memory layout using quick and simple binary-mask operations. Actual experimental results and simulations illustrate that performance is greatly improved because of the considerable reduction of cache misses.