Solving dense linear systems on platforms with multiple hardware accelerators

  • Authors:
  • Gregorio Quintana-Ortí;Francisco D. Igual;Enrique S. Quintana-Ortí;Robert A. van de Geijn

  • Affiliations:
  • Universidad Jaime I, Castellon, Spain;Universidad Jaime I, Castellon, Spain;Universidad Jaime I, Castellon, Spain;The University of Texas at Austin, Austin, TX, USA

  • Venue:
  • Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

In a previous PPoPP paper we showed how the FLAME methodology, combined with the SuperMatrix runtime system, yields a simple yet powerful solution for programming dense linear algebra operations on multicore platforms. In this paper we provide further evidence that this approach solves the programmability problem for this domain by targeting a more complex architecture, composed of a multicore processor and multiple hardware accelerators (GPUs, Cell B.E., etc.), each with its own local memory, resulting in a platform more reminiscent of a heterogeneous distributed-memory system. In particular, we show that the FLAME programming model accommodates this new situation effortlessly so that no significant change needs to be made to the codebase. All complexity is hidden inside the SuperMatrix runtime scheduling mechanism, which incorporates software implementations of standard cache/memory coherence techniques in computer architecture to improve the performance. Our experimental evaluation on a Intel Xeon 8-core host linked to an NVIDIA Tesla S870 platform with four GPUs delivers peak performances around 550 and 450 (single-precision) GFLOPS for the matrix-matrix product and the Cholesky factorization, respectively, which we believe to be the best performance numbers posted on this new architecture for such operations.