Augmenting Loop Tiling with Data Alignment for Improved Cache Performance

Authors:
Preeti Ranjan Panda;Hiroshi Nakamura;Nikil D. Dutt;Alexandru Nicolau
Affiliations:
Synopsys, Inc., Mountain View, CA;Univ. of Tokyo, Tokyo, Japan;Univ. of California at Irvine, Irvine;Univ. of California at Irvine, Irvine
Venue:
IEEE Transactions on Computers - Special issue on cache memory and related problems
Year:
1999

Citing 11
Cited 33

More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
C language algorithms for digital signal processing

C language algorithms for digital signal processing
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Compiler blockability of numerical algorithms

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Numerical recipes in C (2nd ed.): the art of scientific computing

Numerical recipes in C (2nd ed.): the art of scientific computing
To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Cache Profiling and the SPEC Benchmarks: A Case Study

Computer
Improving cache Performance Through Tiling and Data Alignment

IRREGULAR '97 Proceedings of the 4th International Symposium on Solving Irregularly Structured Problems in Parallel
Lazy Prefetching

HICSS '98 Proceedings of the Thirty-First Annual Hawaii International Conference on System Sciences-Volume 7 - Volume 7

Code placement in hardware/software co-synthesis to improve performance and reduce cost

Proceedings of the conference on Design, automation and test in Europe
Tiling optimizations for 3D scientific computations

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Evaluating the impact of memory system performance on software prefetching and locality optimizations

ICS '01 Proceedings of the 15th international conference on Supercomputing
Source code transformation based on software cost analysis

Proceedings of the 14th international symposium on Systems synthesis
Reducing Cache Conflicts by Multi-Level Cache Partitioning and Array Elements Mapping

The Journal of Supercomputing
I-CoPES: fast instruction code placement for embedded systems to improve performance and energy efficiency

Proceedings of the 2001 IEEE/ACM international conference on Computer-aided design
Memory Architectures for Embedded Systems-On-Chip

HiPC '02 Proceedings of the 9th International Conference on High Performance Computing
Advanced Data Layout Optimization for Multimedia Applications

IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
Cache Remapping to Improve the Performance of Tiled Algorithms

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Software Controlled Reconfigurable On-Chip Memory for High Performance Computing

IMS '00 Revised Papers from the Second International Workshop on Intelligent Memory Systems
Address Code and Arithmetic Optimizations for Embedded Systems

ASP-DAC '02 Proceedings of the 2002 Asia and South Pacific Design Automation Conference
SCIMA: A Novel Architecture for High Performance Computing

IWIA '99 Proceedings of the 1999 International Workshop on Innovative Architecture
Tiling, Block Data Layout, and Memory Hierarchy Performance

IEEE Transactions on Parallel and Distributed Systems
A Quantitative Analysis of Tile Size Selection Algorithms

The Journal of Supercomputing
A proposal for input-sensitivity analysis of profile-driven optimizations on embedded applications

MEDEA '03 Proceedings of the 2003 workshop on MEmory performance: DEaling with Applications , systems and architecture
Cache Conscious Data Layout Organization for Conflict Miss Reduction in Embedded Multimedia Applications

IEEE Transactions on Computers
Automatic tiling of iterative stencil loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
Instruction code mapping for performance increase and energy reduction in embedded computer systems

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Optimizing instruction cache performance of embedded systems

ACM Transactions on Embedded Computing Systems (TECS)
Reduction Transformations for Optimization Parameter Selection

HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
A New Genetic Algorithm for Loop Tiling

The Journal of Supercomputing
Fast indexing for blocked array layouts to reduce cache misses

International Journal of High Performance Computing and Networking
Adaptive Winograd's matrix multiplications

ACM Transactions on Mathematical Software (TOMS)
Simultaneous minimization of capacity and conflict misses

Journal of Computer Science and Technology
Modeling and exploiting spatial locality trade-offs in wavelet-based applications under varying resource requirements

ACM Transactions on Embedded Computing Systems (TECS)
An OpenMP implementation of parallel FFT and its performance on IA-64 processors

WOMPAT'03 Proceedings of the OpenMP applications and tools 2003 international conference on OpenMP shared memory parallel programming
Architecture exploration for efficient data transfer and storage in data-parallel applications

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Optimizing integrated application performance with cache-aware metascheduling

OTM'11 Proceedings of the 2011th Confederated international conference on On the move to meaningful internet systems - Volume Part II
Optimizing matrix multiplication with a classifier learning system

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Tuning blocked array layouts to exploit memory hierarchy in SMT architectures

PCI'05 Proceedings of the 10th Panhellenic conference on Advances in Informatics
Low power engineering

Embedded Systems Design
Runtime adaptation: a case for reactive code alignment

Proceedings of the 2nd International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Tile size selection revisited

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Loop blocking (tiling) is a well-known compiler optimization that helps improve cache performance by dividing the loop iteration space into smaller blocks (tiles); reuse of array elements within each tile is maximized by ensuring that the working set for the tile fits into the data cache. Padding is a data alignment technique that involves the insertion of dummy elements into a data structure for improving cache performance. In this work, we present DAT, a technique that augments loop tiling with data alignment, achieving improved efficiency (by ensuring that the cache is never under-utilized) as well as improved flexibility (by eliminating self-interference cache conflicts independent of the tile size). This results in a more stable and better cache performance than existing approaches, in addition to maximizing cache utilization, eliminating self-interference, and minimizing cross-interference conflicts. Further, while all previous efforts are targetted at programs characterized by the reuse of a single array, we also address the issue of minimizing conflict misses when several tiled arrays are involved. To validate our technique, we ran extensive experiments using both simulations as well as actual measurements on SUN Sparc5 and Sparc10 workstations. The results on benchmarks exhibiting varying memory access patterns demonstrate the effectiveness of our technique through consistently high hit ratios and improved performance across varying problem sizes.