Memory hierarchy optimizations and performance bounds for sparse ATAx

  • Authors:
  • Richard Vuduc;Attila Gyulassy;James W. Demmel;Katherine A. Yelick

  • Affiliations:
  • University of California, Berkeley;University of California, Berkeley;University of California, Berkeley;University of California, Berkeley

  • Venue:
  • ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents uniprocessor performance optimizations, automatic tuning techniques, and an experimental analysis of the sparse matrix operation, y = AT Ax, where A is a sparse matrix and x, y are dense vectors. We describe an implementation of this computational kernel which brings A through the memory hierarchy only once, and which can be combined naturally with the register blocking optimization previously proposed in the Sparsity tuning system for sparse matrix-vector multiply. We evaluate these optimizations on a benchmark set of 44 matrices and 4 platforms, showing speedups of up to 4.2×. We also develop platform-specific upper-bounds on the performance of these implementations. We analyze how closely we can approach these bounds, and show when low-level tuning techniques (e.g., better instruction scheduling) are likely to yield a significant pay-off. Finally, we propose a hybrid off-line/run-time heuristic which in practice automatically selects nearoptimal values of the key tuning parameters, the register block sizes.