Scaling LAPACK panel operations using parallel cache assignment

  • Authors:
  • Anthony M. Castaldo;R. Clint Whaley;Siju Samuel

  • Affiliations:
  • University of Texas at San Antonio, TX;University of Texas at San Antonio, TX;University of Texas at San Antonio, TX

  • Venue:
  • ACM Transactions on Mathematical Software (TOMS)
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

In LAPACK many matrix operations are cast as block algorithms which iteratively process a panel using an unblocked algorithm and then update a remainder matrix using the high performance Level 3 BLAS. The Level 3 BLAS have excellent scaling, but panel processing tends to be bus bound, and thus scales with bus speed rather than the number of processors (p). Amdahl's law therefore ensures that as p grows, the panel computation will become the dominant cost of these LAPACK routines. Our contribution is a novel parallel cache assignment approach to the panel factorization which we show scales well with p. We apply this general approach to the QR, QL, RQ, LQ and LU panel factorizations. We show results for two commodity platforms: an 8-core Intel platform and a 32-core AMD platform. For both platforms and all twenty implementations (five factorizations each of which is available in 4 types), we present results that demonstrate that our approach yields significant speedup over the existing state of the art.