Mapping the LU decomposition on a many-core architecture: challenges and solutions

  • Authors:
  • Ioannis E. Venetis;Guang R. Gao

  • Affiliations:
  • University of Patras, Patras, Greece;University of Delaware, Newark, DE, USA

  • Venue:
  • Proceedings of the 6th ACM conference on Computing frontiers
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recently, multi-core architectures with alternative memory subsystem designs have emerged. Instead of using hardware-managed cache hierarchies, they employ software-managed embedded memory. An open question is what programming and compiling methods are effective to exploit the performance potential of this new class of architectures. Using the LU decomposition as a case study, we propose three techniques that combined achieve a 27 times speedup on the IBM Cyclops-64 many-core architecture, compared to the parallel LU implementation from the SPLASH-2 benchmarks suite. Our first method allows adaptive load distribution, which maximizes load-balance among cores - this is important to leverage the potential of the next two methods. Secondly, we developed a method for register tiling that determines the optimal data tile parameters and maximizes data reuse according to register file size constraints. We demonstrate that our method is inherently general and that it should have a much broader applicability beyond Cyclops-64. Thirdly, we present a register allocation method for register tiled loop bodies. We evaluate its effect through hand-tuned Cyclops-64 assembly code and observe a 6-fold reduction in load/store operations. We achieve a performance of 19.17 and 27.50 GFlops with double-precision floating point numbers, for a 700x700 and a 1000x1000 matrix respectively.