On improving the memory access patterns during the execution of Strassen's matrix multiplication algorithm

Authors:
Hossam ElGindy;George Ferizis
Affiliations:
The University of New South Wales, Sydney, NSW, Australia;The University of New South Wales, Sydney, NSW, Australia
Venue:
ACSC '04 Proceedings of the 27th Australasian conference on Computer science - Volume 26
Year:
2004

Citing 4
Cited 0

Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Recursive array layouts and fast parallel matrix multiplication

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Tuning Strassen's matrix multiplication for memory efficiency

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Efficient Procedures for Using Matrix Algorithms

Proceedings of the 2nd Colloquium on Automata, Languages and Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Matrix multiplication is a basic computing operation. Whereas it is basic, it is also very expensive with a straight forward technique of O (N3) runtime complexity. More complex solutions such as Strassen's algorithm exist that reduce this complexity to O(Nlog 2 7); the recursive nature of such algorithms place a large burden on memory systems due to temporary storage and the lack of locality in their access patternsIn this paper we propose a scheme for reordering the matrix entries stored in memory. This reordering provides two major benefits: a simple method to transform the recursive algorithm into an iterative one, and also a simple method for maintaining memory locality over the entire operation. These two features both provide an improvement in performance that grows as the problem size increases.The proposed reordering scheme has been implemented in C. Testing of our C implementation, which eliminates the need for unnecessary storage of matrix elements from previous iterations, with matrices of size up-to 2048 × 2048 exhibits improvement of 27.05% and 8.9% over the original algorithm and another reordering scheme respectively.