Matrix multiplication: a case study of enhanced data cache utilization

Authors:
N. Eiron;M. Rodeh;I. Steinwarts
Affiliations:
-;IBM Haifa;IBM Haifa
Venue:
Journal of Experimental Algorithmics (JEA)
Year:
1999

Citing 10
Cited 5

Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Software prefetching

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Improving performance of linear algebra algorithms for dense matrices, using algorithmic prefetch

IBM Journal of Research and Development
Reducing cache conflicts in data cache prefetching

ACM SIGARCH Computer Architecture News - Special issue on input/output in parallel computer systems
Complexity/performance tradeoffs with non-blocking loads

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tolerating latency through software-controlled data prefetching

Tolerating latency through software-controlled data prefetching
Data prefetching and multilevel blocking for linear algebra operations

ICS '96 Proceedings of the 10th international conference on Supercomputing
Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach

High-Performance Algorithm Engineering for Computational Phylogenetics

The Journal of Supercomputing - Special issue on computational issues in fluid dynamics optimization and simulation
High-Performance Algorithm Engineering for Computational Phylogenetics

ICCS '01 Proceedings of the International Conference on Computational Science-Part II
Reconstructing optimal phylogenetic trees: a challenge in experimental algorithmics

Experimental algorithmics
A blocked all-pairs shortest-paths algorithm

Journal of Experimental Algorithmics (JEA)
Algorithms for memory hierarchies: advanced lectures

Algorithms for memory hierarchies: advanced lectures

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern machines present two challenges to algorithm engineers and compiler writers: They have superscalar, super-pipelined structure, and they have elaborate memory subsystems specifically designed to reduce latency and increase bandwidth. Matrix multiplication is a classical benchmark for experimenting with techniques used to exploit machine architecture and to overcome the limitations of contemporary memory subsystems.This research aims at advancing the state of the art of algorithm engineering by balancing instruction level parallelism, two levels of data tiling, copying to provably avoid any cache conflicts, and prefetching in parallel to computational operations, in order to fully exploit the memory bandwidth. Measurements on IBM's RS/6000 43P workstation show that the resultant matrix multiplication algorithm outperforms IBM's ESSL by 6.8-31.8%, is less sensitive to the size of the input data, and scales better.In this paper we introduce a cache aware algorithm for matrix multiplication. We also suggest generic guidelines that may be applied to compute intensive algorithm to efficiently utilize the data cache. We believe that some of our concepts may be embodied in compilers.