Exploiting the Locality Properties of Peano Curves for Parallel Matrix Multiplication

Authors:
Michael Bader
Affiliations:
Institut für Informatik, Technische Universität München, Germany
Venue:
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Year:
2008

Citing 4
Cited 0

ARMCI: A Portable Remote Memory Copy Libray for Ditributed Array Libraries and Compiler Run-Time Systems

Proceedings of the 11 IPPS/SPDP'99 Workshops Held in Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing
Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit

International Journal of High Performance Computing Applications
Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms

Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
Hardware-oriented implementation of cache oblivious matrix operations based on space-filling curves

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics

Quantified Score

Hi-index	0.00

Visualization

Abstract

The present work studies an approach to exploit the locality properties of an inherently cache-efficient algorithm for matrix multiplication in a parallel implementation. The algorithm is based on a blockwise element layout and an execution order that are derived from a Peano space-filling curve. The strong locality properties induced in the resulting algorithm motivate a parallel algorithm that replicates matrix blocks in local caches that will prefetch remote blocks before they are used. As a consequence, the block size for matrix multiplication and the cache sizes, and hence the granularity of communication, can be chosen independently. The influence of these parameters on parallel efficiency is studied on a compute cluster with 128 processors. Performance studies show that the largest influence on performance stems from the size of the local caches, which makes the algorithm an interesting option for all situations where memory is scarce, or where existing cache hierarchies can be exploited (as in future manycore environments, e.g.).