Memory hierarchy optimizations and performance bounds for sparse ATAx

Authors:
Richard Vuduc;Attila Gyulassy;James W. Demmel;Katherine A. Yelick
Affiliations:
University of California, Berkeley;University of California, Berkeley;University of California, Berkeley;University of California, Berkeley
Venue:
ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Year:
2003

Citing 11
Cited 4

Characterizing the behavior of sparse algorithms on caches

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Adaptive use of iterative methods in interior point methods for linear programming

Adaptive use of iterative methods in interior point methods for linear programming
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
Applied numerical linear algebra

Applied numerical linear algebra
A relational approach to the automatic generation of sequential sparse matrix codes

A relational approach to the automatic generation of sequential sparse matrix codes
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Automatic Nonzero Structure Analysis

SIAM Journal on Computing
A scalable cross-platform infrastructure for application performance tuning using hardware counters

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Performance optimizations and bounds for sparse matrix-vector multiply

Proceedings of the 2002 ACM/IEEE conference on Supercomputing

Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Parallel Computing
Parallel memory prediction for fused linear algebra kernels

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Performance tuning of matrix triple products based on matrix structure

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents uniprocessor performance optimizations, automatic tuning techniques, and an experimental analysis of the sparse matrix operation, y = AT Ax, where A is a sparse matrix and x, y are dense vectors. We describe an implementation of this computational kernel which brings A through the memory hierarchy only once, and which can be combined naturally with the register blocking optimization previously proposed in the Sparsity tuning system for sparse matrix-vector multiply. We evaluate these optimizations on a benchmark set of 44 matrices and 4 platforms, showing speedups of up to 4.2×. We also develop platform-specific upper-bounds on the performance of these implementations. We analyze how closely we can approach these bounds, and show when low-level tuning techniques (e.g., better instruction scheduling) are likely to yield a significant pay-off. Finally, we propose a hybrid off-line/run-time heuristic which in practice automatically selects nearoptimal values of the key tuning parameters, the register block sizes.