Automatic performance tuning of sparse matrix kernels

  • Authors:
  • Richard Wilson Vuduc;James W. Demmel

  • Affiliations:
  • -;-

  • Venue:
  • Automatic performance tuning of sparse matrix kernels
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

This dissertation presents an automated system to generate highly efficient, platform-adapted implementations of sparse matrix kernels. We show that conventional implementations of important sparse kernels like sparse matrix-vector multiply (SpMV) have historically run at 10% or less of peak machine speed on cache-based superscalar architectures. Our implementations of SpMV, automatically tuned using a methodology based on empirical-search, can by contrast achieve up to 31% of peak machine speed, and can be up to 4× faster. Given a matrix, kernel, and machine; our approach to selecting a fast implementation consists of two steps: (1) we identify and generate a space of reasonable implementations, and then (2) search this space for the fastest one using a combination of heuristic models and actual experiments (i.e., running and timing the code). We build on the SPARSITY system for generating highly-tuned implementations of the SpMV kernel y ← y + Ax, where A is a sparse matrix and x, y are dense vectors. We extend SPARSITY to support tuning for a variety of common non-zero patterns arising in practice, and for additional kernels like sparse triangular solve (SpTS) and computation of ATA·x (or AAT·x) and A ρ·x. We develop new models to compute, for particular data structures and kernels, the best absolute performance (e.g., Mflop/s) we might expect on a given matrix and machine. These performance upper bounds account for the cost of memory operations at all levels of the memory hierarchy, but assume ideal instruction scheduling and low-level tuning. We evaluate our performance with respect to such bounds, finding that the generated and tuned implementations of SpMV and SpTS achieve up to 75% of the performance bound. This finding places limits on the effectiveness of additional low-level tuning (e.g., better instruction selection and scheduling). (Abstract shortened by UMI.)