Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
A scalable parallel Strassen's matrix multiplication algorithm for distributed-memory computers
SAC '95 Proceedings of the 1995 ACM symposium on Applied computing
Library support for hierarchical multi-processor tasks
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Automatically Tuned Linear Algebra Software
Automatically Tuned Linear Algebra Software
A High Performance Parallel Strassen Implementation
A High Performance Parallel Strassen Implementation
Parallel sparse LU factorization on different message passing platforms
Journal of Parallel and Distributed Computing
Optimal solution to matrix parenthesization problem employing parallel processing approach
EC'07 Proceedings of the 8th Conference on 8th WSEAS International Conference on Evolutionary Computing - Volume 8
Mixed task and data parallel executions in general linear methods
Scientific Programming
Adaptive approaches for efficient parallel algorithms on cluster-based systems
International Journal of Grid and Utility Computing
Anticipated distributed task scheduling for grid environments
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Reducing the overhead of intra-node communication in clusters of SMPs
ISPA'05 Proceedings of the Third international conference on Parallel and Distributed Processing and Applications
Automatic tuning of PDGEMM towards optimal performance
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Design and evaluation of a parallel data redistribution component for TGrid
ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
Hi-index | 0.00 |
Matrix-matrix multiplication is one of the core computations in many algorithms from scientific computing or numerical analysis and many efficient realizations have been invented over the years, including many parallel ones. The current trend to use clusters of PCs or SMPs for scientific computing suggests to revisit matrix-matrix multiplication and investigate efficiency and scalability of different versions on clusters. In this paper we present parallel algorithms for matrix-matrix multiplication which are built up from several algorithms in a multilevel structure. Each level is associated with a hierarchical partition of the set of available processors into disjoint subsets so that deeper levels of the algorithm employ smaller groups of processors in parallel. We perform runtime experiments on several parallel platforms and show that multilevel algorithms can lead to significant performance gains compared with state-of-the-art methods.