A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Recursion leads to automatic variable blocking for dense linear-algebra algorithms
IBM Journal of Research and Development
Nonlinear array layouts for hierarchical memory systems
ICS '99 Proceedings of the 13th international conference on Supercomputing
LAPACK Working Note 20: A Portable Linear Algebra Library For High-Performance Computers
LAPACK Working Note 20: A Portable Linear Algebra Library For High-Performance Computers
Anatomy of high-performance matrix multiplication
ACM Transactions on Mathematical Software (TOMS)
Data and thread affinity in openmp programs
Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms
Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
Hardware-oriented implementation of cache oblivious matrix operations based on space-filling curves
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Porting existing cache-oblivious linear algebra HPC modules to larrabee architecture
Proceedings of the 7th ACM international conference on Computing frontiers
Hi-index | 0.00 |
We present our recent research on cache-oblivious algorithms and implementations of parallel LU decomposition on shared-memory multi- and manycore platforms. Our approach uses a block-recursive matrix storage scheme based on space filling curves, and thus extends our work presented at CF'08. The data structure is based on Peano curves, and is separated into a coarse-grain recursive block-matrix scheme, and a fine-grain iterative order for the elementary matrix blocks. The block element order is derived from the recursive construction of a Peano space-filling curve. The block matrices are stored in ordinary row-major order, and form elementary data types for the block operations. The block size is chosen to perfectly fit the lowest-level data cache in the CPU's cache hierarchy. All matrix operations on this two-level data structure are implemented via routines working on block matrices as operands, and are optimised assembler to exploit the SIMD capacities of the CPUs. For parallelisation on shared memory platforms, we compare two different OpenMP implementations -- one based on OpenMP 2.0, which requires explicit scheduling of the block operations to processor cores, and an implementation that exploits the new task concept in OpenMP 3.0. Performance tests on various platforms ranging from desktop systems to an SGI Altix supercomputer, showed that our implementation 'TifaMMy' optimises the use of the available memory hardware by reducing the bandwidth requirements. Hence, the cache-oblivious approach of TifaMMy is also efficient in the context of multi- and manycore environments. We also demonstrated that the OpenMP 3.0 task concept can lead to both well-structured implementations and competitive parallel efficiency for block-recursive, cache-oblivious algorithms.