Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
A fast Fourier transform compiler
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Implementation of Strassen's algorithm for matrix multiplication
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
SPL: a language and compiler for DSP algorithms
Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Optimizing compilers for modern architectures: a dependence-based approach
Optimizing compilers for modern architectures: a dependence-based approach
Tuning Strassen's matrix multiplication for memory efficiency
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Recursive Array Layouts and Fast Matrix Multiplication
IEEE Transactions on Parallel and Distributed Systems
Efficient Procedures for Using Matrix Algorithms
Proceedings of the 2nd Colloquium on Automata, Languages and Programming
Automatically Tuned Linear Algebra Software
Automatically Tuned Linear Algebra Software
HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Optimizing matrix multiplication with a classifier learning system
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Hi-index | 0.02 |
Matrix-Matrix Multiplication (MMM) is a highly important kernel in linear algebra algorithms and the performance of its implementations depends on the memory utilization and data locality. There are MMM algorithms, such as standard, Strassen---Winograd variant, and many recursive array layouts, such as Z-Morton or U-Morton. However, their data locality is lower than that of the proposed methodology. Moreover, several SOA (state of the art) self-tuning libraries exist, such as ATLAS for MMM algorithm, which tests many MMM implementations. During the installation of ATLAS, on the one hand an extremely complex empirical tuning step is required, and on the other hand a large number of compiler options are used, both of which are not included in the scope of this paper. In this paper, a new methodology using the standard MMM algorithm is presented, achieving improved performance by focusing on data locality (both temporal and spatial). This methodology finds the scheduling which conforms with the optimum memory management. Compared with (Chatterjee et al. in IEEE Trans. Parallel Distrib. Syst. 13:1105, 2002; Li and Garzaran in Proc. of Lang. Compil. Parallel Comput., 2005; Bilmes et al. in Proc. of the 11th ACM Int. Conf. Super-comput., 1997; Aberdeen and Baxter in Concurr. Comput. Pract. Exp. 13:103, 2001), the proposed methodology has two major advantages. Firstly, the scheduling used for the tile level is different from the element level's one, having better data locality, suited to the sizes of memory hierarchy. Secondly, its exploration time is short, because it searches only for the number of the level of tiling used, and between (1, 2) (Sect. 4) for finding the best tile size for each cache level. A software tool (C-code) implementing the above methodology was developed, having the hardware model and the matrix sizes as input. This methodology has better performance against others at a wide range of architectures. Compared with the best existing related work, which we implemented, better performance up to 55% than the Standard MMM algorithm and up to 35% than Strassen's is observed, both under recursive data array layouts.