Cache-sensitive MapReduce DGEMM algorithms for shared memory architectures

Authors:
Gideon Nimako;E. J. Otoo;Daniel Ohene-Kwofie
Affiliations:
The University of the Witwatersrand, Johannesburg, South Africa;The University of the Witwatersrand, Johannesburg, South Africa;The University of the Witwatersrand, Johannesburg, South Africa
Venue:
Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference
Year:
2012

Citing 14
Cited 0

Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Tolerating latency through software-controlled data prefetching

Tolerating latency through software-controlled data prefetching
A cellular computer to implement the kalman filter algorithm

A cellular computer to implement the kalman filter algorithm
Matrix Multiplication Performance on Commodity Shared-Memory Multiprocessors

PARELEC '04 Proceedings of the international conference on Parallel Computing in Electrical Engineering
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Evaluating MapReduce for Multi-core and Multiprocessor Systems

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation)

Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation)
Scheduling dense linear algebra operations on multicore processors

Concurrency and Computation: Practice & Experience
Phoenix rebirth: Scalable MapReduce on a large-scale shared-memory system

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Optimizing OpenMP parallelized DGEMM calls on SGI altix 3700

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Parallelization of general matrix multiply routines using OpenMP

WOMPAT'04 Proceedings of the 5th international conference on OpenMP Applications and Tools: shared Memory Parallel Programming with OpenMP

Quantified Score

Hi-index	0.00

Visualization

Abstract

Parallelism in linear algebra libraries is a common approach to accelerate numerical and scientific applications. Matrix-matrix multiplication is one of the most widely used computations in scientific and numerical algorithms. Although a number of matrix multiplication algorithms exist for distributed memory environments (e.g., Cannon, Fox, PUMMA, SUMMA), matrix-matrix multiplication algorithms for shared memory and SMP architectures have not been extensively studied. In this paper, we present a fast matrix-matrix multiplication algorithm for multi-core and SMP architectures using the MapReduce framework. Memory-resident linear algebra algorithms suffer performance losses on modern multi-core architectures because of the increasing performance gap between the CPU and main memory. To allow such compute-intensive algorithms to exploit the full potential of the program's inherent instruction level parallelism, the adverse effect of the processor-memory performance gap should be minimized. We present a cache-sensitive MapReduce matrix multiplication algorithm that fully exploits memory bandwidth and minimize cache misses and conflicts. Our experimental results show that the two algorithms outperform existing matrix multiplication algorithms for shared-memory architectures such as those given in the Phoenix, PLASMA and LAPACK libraries.