Software pipelining: an effective scheduling technique for VLIW machines
PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Design and evaluation of a compiler algorithm for prefetching
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Tolerating latency through software-controlled data prefetching
Tolerating latency through software-controlled data prefetching
A cellular computer to implement the kalman filter algorithm
A cellular computer to implement the kalman filter algorithm
Matrix Multiplication Performance on Commodity Shared-Memory Multiprocessors
PARELEC '04 Proceedings of the international conference on Parallel Computing in Electrical Engineering
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Evaluating MapReduce for Multi-core and Multiprocessor Systems
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation)
Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation)
Scheduling dense linear algebra operations on multicore processors
Concurrency and Computation: Practice & Experience
Phoenix rebirth: Scalable MapReduce on a large-scale shared-memory system
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Optimizing OpenMP parallelized DGEMM calls on SGI altix 3700
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Parallelization of general matrix multiply routines using OpenMP
WOMPAT'04 Proceedings of the 5th international conference on OpenMP Applications and Tools: shared Memory Parallel Programming with OpenMP
Hi-index | 0.00 |
Parallelism in linear algebra libraries is a common approach to accelerate numerical and scientific applications. Matrix-matrix multiplication is one of the most widely used computations in scientific and numerical algorithms. Although a number of matrix multiplication algorithms exist for distributed memory environments (e.g., Cannon, Fox, PUMMA, SUMMA), matrix-matrix multiplication algorithms for shared memory and SMP architectures have not been extensively studied. In this paper, we present a fast matrix-matrix multiplication algorithm for multi-core and SMP architectures using the MapReduce framework. Memory-resident linear algebra algorithms suffer performance losses on modern multi-core architectures because of the increasing performance gap between the CPU and main memory. To allow such compute-intensive algorithms to exploit the full potential of the program's inherent instruction level parallelism, the adverse effect of the processor-memory performance gap should be minimized. We present a cache-sensitive MapReduce matrix multiplication algorithm that fully exploits memory bandwidth and minimize cache misses and conflicts. Our experimental results show that the two algorithms outperform existing matrix multiplication algorithms for shared-memory architectures such as those given in the Phoenix, PLASMA and LAPACK libraries.