Recursion leads to automatic variable blocking for dense linear-algebra algorithms
IBM Journal of Research and Development
The Design and Use of Algorithms for Permuting Large Entries to the Diagonal of Sparse Matrices
SIAM Journal on Matrix Analysis and Applications
Cache oblivious matrix operations using Peano curves
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
A cache oblivious algorithm for matrix multiplication based on peano's space filling curve
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms
Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
Exploiting the Locality Properties of Peano Curves for Parallel Matrix Multiplication
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Towards many-core implementation of LU decomposition using Peano Curves
Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop
Porting existing cache-oblivious linear algebra HPC modules to larrabee architecture
Proceedings of the 7th ACM international conference on Computing frontiers
Communication-optimal Parallel and Sequential Cholesky Decomposition
SIAM Journal on Scientific Computing
An approach for semiautomatic locality optimizations using OpenMP
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Hi-index | 0.00 |
We will present hardware-oriented implementations of blockrecursive approaches for matrix operations, esp. matrix multiplication and LU decomposition. An element order based on a recursively constructed Peano space-filling curve is used to store the matrix elements. This block-recursive numbering scheme is changed into a standard rowmajor order, as soon as the respective matrix subblocks fit into level-1 cache. For operations on these small blocks, we implemented hardwareoriented kernels optimised for Intel's Core architecture. The resulting matrix-multiplication and LU-decomposition codes compete well with optimised libraries such as Intel's MKL, ATLAS, or GotoBLAS, but have the advantage that only comparably small and well-defined kernel operations have to be optimised to achieve high performance.