Hardware-oriented implementation of cache oblivious matrix operations based on space-filling curves

Authors:
Michael Bader;Robert Franz;Stephan Günther;Alexander Heinecke
Affiliations:
Dept. of Informatics, TU München, München, Germany;Dept. of Informatics, TU München, München, Germany;Dept. of Informatics, TU München, München, Germany;Dept. of Informatics, TU München, München, Germany
Venue:
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Year:
2007

Citing 4
Cited 6

Recursion leads to automatic variable blocking for dense linear-algebra algorithms

IBM Journal of Research and Development
The Design and Use of Algorithms for Permuting Large Entries to the Diagonal of Sparse Matrices

SIAM Journal on Matrix Analysis and Applications
Cache oblivious matrix operations using Peano curves

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
A cache oblivious algorithm for matrix multiplication based on peano's space filling curve

PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics

Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms

Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
Exploiting the Locality Properties of Peano Curves for Parallel Matrix Multiplication

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Towards many-core implementation of LU decomposition using Peano Curves

Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop
Porting existing cache-oblivious linear algebra HPC modules to larrabee architecture

Proceedings of the 7th ACM international conference on Computing frontiers
Communication-optimal Parallel and Sequential Cholesky Decomposition

SIAM Journal on Scientific Computing
An approach for semiautomatic locality optimizations using OpenMP

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

We will present hardware-oriented implementations of blockrecursive approaches for matrix operations, esp. matrix multiplication and LU decomposition. An element order based on a recursively constructed Peano space-filling curve is used to store the matrix elements. This block-recursive numbering scheme is changed into a standard rowmajor order, as soon as the respective matrix subblocks fit into level-1 cache. For operations on these small blocks, we implemented hardwareoriented kernels optimised for Intel's Core architecture. The resulting matrix-multiplication and LU-decomposition codes compete well with optimised libraries such as Intel's MKL, ATLAS, or GotoBLAS, but have the advantage that only comparably small and well-defined kernel operations have to be optimised to achieve high performance.