Towards many-core implementation of LU decomposition using Peano Curves

  • Authors:
  • Alexander Heinecke;Michael Bader

  • Affiliations:
  • Technische Universität München, München, Germany;Technische Universität München, München, Germany

  • Venue:
  • Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present our recent research on cache-oblivious algorithms and implementations of parallel LU decomposition on shared-memory multi- and manycore platforms. Our approach uses a block-recursive matrix storage scheme based on space filling curves, and thus extends our work presented at CF'08. The data structure is based on Peano curves, and is separated into a coarse-grain recursive block-matrix scheme, and a fine-grain iterative order for the elementary matrix blocks. The block element order is derived from the recursive construction of a Peano space-filling curve. The block matrices are stored in ordinary row-major order, and form elementary data types for the block operations. The block size is chosen to perfectly fit the lowest-level data cache in the CPU's cache hierarchy. All matrix operations on this two-level data structure are implemented via routines working on block matrices as operands, and are optimised assembler to exploit the SIMD capacities of the CPUs. For parallelisation on shared memory platforms, we compare two different OpenMP implementations -- one based on OpenMP 2.0, which requires explicit scheduling of the block operations to processor cores, and an implementation that exploits the new task concept in OpenMP 3.0. Performance tests on various platforms ranging from desktop systems to an SGI Altix supercomputer, showed that our implementation 'TifaMMy' optimises the use of the available memory hardware by reducing the bandwidth requirements. Hence, the cache-oblivious approach of TifaMMy is also efficient in the context of multi- and manycore environments. We also demonstrated that the OpenMP 3.0 task concept can lead to both well-structured implementations and competitive parallel efficiency for block-recursive, cache-oblivious algorithms.