Towards many-core implementation of LU decomposition using Peano Curves

Authors:
Alexander Heinecke;Michael Bader
Affiliations:
Technische Universität München, München, Germany;Technische Universität München, München, Germany
Venue:
Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop
Year:
2009

Citing 8
Cited 1

A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Recursion leads to automatic variable blocking for dense linear-algebra algorithms

IBM Journal of Research and Development
Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
LAPACK Working Note 20: A Portable Linear Algebra Library For High-Performance Computers

LAPACK Working Note 20: A Portable Linear Algebra Library For High-Performance Computers
Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software (TOMS)
Data and thread affinity in openmp programs

Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms

Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
Hardware-oriented implementation of cache oblivious matrix operations based on space-filling curves

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics

Porting existing cache-oblivious linear algebra HPC modules to larrabee architecture

Proceedings of the 7th ACM international conference on Computing frontiers

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present our recent research on cache-oblivious algorithms and implementations of parallel LU decomposition on shared-memory multi- and manycore platforms. Our approach uses a block-recursive matrix storage scheme based on space filling curves, and thus extends our work presented at CF'08. The data structure is based on Peano curves, and is separated into a coarse-grain recursive block-matrix scheme, and a fine-grain iterative order for the elementary matrix blocks. The block element order is derived from the recursive construction of a Peano space-filling curve. The block matrices are stored in ordinary row-major order, and form elementary data types for the block operations. The block size is chosen to perfectly fit the lowest-level data cache in the CPU's cache hierarchy. All matrix operations on this two-level data structure are implemented via routines working on block matrices as operands, and are optimised assembler to exploit the SIMD capacities of the CPUs. For parallelisation on shared memory platforms, we compare two different OpenMP implementations -- one based on OpenMP 2.0, which requires explicit scheduling of the block operations to processor cores, and an implementation that exploits the new task concept in OpenMP 3.0. Performance tests on various platforms ranging from desktop systems to an SGI Altix supercomputer, showed that our implementation 'TifaMMy' optimises the use of the available memory hardware by reducing the bandwidth requirements. Hence, the cache-oblivious approach of TifaMMy is also efficient in the context of multi- and manycore environments. We also demonstrated that the OpenMP 3.0 task concept can lead to both well-structured implementations and competitive parallel efficiency for block-recursive, cache-oblivious algorithms.