Optimal matrix algorithms on homogeneous hypercubes
C3P Proceedings of the third conference on Hypercube concurrent computers and applications - Volume 2
A cellular computer to implement the kalman filter algorithm
A cellular computer to implement the kalman filter algorithm
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Anatomy of high-performance matrix multiplication
ACM Transactions on Mathematical Software (TOMS)
High-performance implementation of the level-3 BLAS
ACM Transactions on Mathematical Software (TOMS)
Performance study of matrix computations using multi-core programming tools
Proceedings of the Fifth Balkan Conference in Informatics
Hi-index | 0.00 |
Multicore systems are becoming ubiquituous in scientificcomputing. As performance libraries are adapted to such systems, thedifficulty to extract the best performance out of them is quite high. Indeed,performance libraries such as Intel's MKL, while performing verywell on unicore architectures, see their behaviour degrade when used onmulticore systems. Moreover, even multicore systems show wide differencesamong each other (presence of shared caches, memory bandwidth,etc.) We propose a systematic method to improve the parallel executionof matrix multiplication, through the study of the behavior of unicoreDGEMM kernels in MKL, as well as various other criteria. We show thatour fine-tuning can out-perform Intel's parallel DGEMM of MKL, withperformance gains sometimes up to a factor of two.