Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Recursive array layouts and fast parallel matrix multiplication
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Tuning Strassen's matrix multiplication for memory efficiency
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Efficient Procedures for Using Matrix Algorithms
Proceedings of the 2nd Colloquium on Automata, Languages and Programming
Hi-index | 0.00 |
Matrix multiplication is a basic computing operation. Whereas it is basic, it is also very expensive with a straight forward technique of O (N3) runtime complexity. More complex solutions such as Strassen's algorithm exist that reduce this complexity to O(Nlog 2 7); the recursive nature of such algorithms place a large burden on memory systems due to temporary storage and the lack of locality in their access patternsIn this paper we propose a scheme for reordering the matrix entries stored in memory. This reordering provides two major benefits: a simple method to transform the recursive algorithm into an iterative one, and also a simple method for maintaining memory locality over the entire operation. These two features both provide an improvement in performance that grows as the problem size increases.The proposed reordering scheme has been implemented in C. Testing of our C implementation, which eliminates the need for unnecessary storage of matrix elements from previous iterations, with matrices of size up-to 2048 × 2048 exhibits improvement of 27.05% and 8.9% over the original algorithm and another reordering scheme respectively.