Extra high speed matrix multiplication on the Cray-2
SIAM Journal on Scientific and Statistical Computing
A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Introduction to algorithms
Exploiting fast matrix multiplication within the level 3 BLAS
ACM Transactions on Mathematical Software (TOMS)
Using Strassen's algorithm to accelerate the solution of linear systems
The Journal of Supercomputing
LAPACK's user's guide
Stability of block algorithms with fast level-3 BLAS
ACM Transactions on Mathematical Software (TOMS)
GEMMW: a portable level 3 BLAS Winograd variant of Strassen's matrix-matrix multiply algorithm
Journal of Computational Physics
A Parallelizable Eigensolver for Real Diagonalizable Matrices with Real Eigenvalues
SIAM Journal on Scientific Computing
Efficient Procedures for Using Matrix Algorithms
Proceedings of the 2nd Colloquium on Automata, Languages and Programming
Further Schemes for Combining Matrix Algorithms
Proceedings of the 2nd Colloquium on Automata, Languages and Programming
Algorithms for matrix multiplication
Algorithms for matrix multiplication
Tuning Strassen's matrix multiplication for memory efficiency
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Recursive Array Layouts and Fast Matrix Multiplication
IEEE Transactions on Parallel and Distributed Systems
Finite field linear algebra subroutines
Proceedings of the 2002 international symposium on Symbolic and algebraic computation
Weak minimization of DFA: an algorithm and applications
Theoretical Computer Science - Implementation and application of automata
Statistical Models for Empirical Search-Based Performance Tuning
International Journal of High Performance Computing Applications
HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Adaptive Strassen's matrix multiplication
Proceedings of the 21st annual international conference on Supercomputing
Combining building blocks for parallel multi-level matrix multiplication
Parallel Computing
Dense Linear Algebra over Word-Size Prime Fields: the FFLAS and FFPACK Packages
ACM Transactions on Mathematical Software (TOMS)
Adaptive Winograd's matrix multiplications
ACM Transactions on Mathematical Software (TOMS)
Memory efficient scheduling of Strassen-Winograd's matrix multiplication algorithm
Proceedings of the 2009 international symposium on Symbolic and algebraic computation
Algorithm 898: Efficient multiplication of dense matrices over GF(2)
ACM Transactions on Mathematical Software (TOMS)
Using recursion to boost ATLAS's performance
ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
Graph expansion and communication costs of fast matrix multiplication: regular submission
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
A data locality methodology for matrix---matrix multiplication algorithm
The Journal of Supercomputing
FFT-based dense polynomial arithmetic on multi-cores
HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Graph expansion and communication costs of fast matrix multiplication
Journal of the ACM (JACM)
Fast matrix decomposition in F2
Journal of Computational and Applied Mathematics
Hi-index | 0.00 |
In this paper we report on the development of an efficient and portable implementation of Strassen's matrix multiplication algorithm for matrices of arbitrary size. Our technique for defining the criterion which stops the recursions is more detailed than those generally used, thus allowing enhanced performance for a larger set of input sizes. In addition, we deal with odd matrix dimensions using a method whose usefulness had previously been in question and had not so far been demonstrated. Our memory requirements have also been reduced, in certain cases by 40 to more than 70 percent over other similar implementations. We measure performance of our code on the IBM RS/6000, CRAY YMP C90, and CRAY T3D single processor, and offer comparisons to other codes. Finally, we demonstrate the usefulness of our implementation by using it to perform the matrix multiplications in a large application code.