Automatic performance tuning of sparse matrix kernels

Authors:
Richard Wilson Vuduc;James W. Demmel
Affiliations:
-;-
Venue:
Automatic performance tuning of sparse matrix kernels
Year:
2003

Citing 0
Cited 29

Floating-point sparse matrix-vector multiply for FPGAs

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
The potential of the cell processor for scientific computing

Proceedings of the 3rd conference on Computing frontiers
Optimal sparse matrix dense vector multiplication in the I/O-model

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Performance Optimization and Modeling of Blocked Sparse Kernels

International Journal of High Performance Computing Applications
Scientific computing Kernels on the cell processor

International Journal of Parallel Programming
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Parallel Computing
Pattern-based sparse matrix representation for memory-efficient SMVM kernels

Proceedings of the 23rd international conference on Supercomputing
A design methodology for domain-optimized power-efficient supercomputing

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Implementing sparse matrix-vector multiplication on throughput-oriented processors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Minimizing communication in sparse matrix solvers

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Model-driven autotuning of sparse matrix-vector multiply on GPUs

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
A simulation of large-scale groundwater flow on CUDA-enabled GPUs

Proceedings of the 2010 ACM Symposium on Applied Computing
Cache friendly sparse matrix-vector multiplication

Proceedings of the 4th International Workshop on Parallel and Symbolic Computation
A case for machine learning to optimize multicore performance

HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
Specifying and verifying sparse matrix codes

Proceedings of the 15th ACM SIGPLAN international conference on Functional programming
Parallel implementation of conjugate gradient method on graphics processors

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Evaluating non-square sparse bilinear forms on multiple vector pairs in the I/O-model

MFCS'10 Proceedings of the 35th international conference on Mathematical foundations of computer science
Fast sparse matrix-vector multiplication on GPUs: implications for graph mining

Proceedings of the VLDB Endowment
Optimizing Sparse Data Structures for Matrix-vector Multiply

International Journal of High Performance Computing Applications
Automatically generating and tuning GPU code for sparse matrix-vector multiplication from a high-level representation

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
CRSD: application specific auto-tuning of SpMV for diagonal sparse matrices

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
GPU accelerated CAE using open solvers and the cloud

ACM SIGARCH Computer Architecture News
Fast sparse matrix-vector multiplication by exploiting variable block structure

HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Automatically tuning sparse matrix-vector multiplication for GPU architectures

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs

Proceedings of the 26th ACM international conference on Supercomputing
Improving GPU sparse matrix-vector multiplication for probabilistic model checking

SPIN'12 Proceedings of the 19th international conference on Model Checking Software
Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform

The Journal of Supercomputing
Application-tailored linear algebra algorithms: A search-based approach

International Journal of High Performance Computing Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

This dissertation presents an automated system to generate highly efficient, platform-adapted implementations of sparse matrix kernels. We show that conventional implementations of important sparse kernels like sparse matrix-vector multiply (SpMV) have historically run at 10% or less of peak machine speed on cache-based superscalar architectures. Our implementations of SpMV, automatically tuned using a methodology based on empirical-search, can by contrast achieve up to 31% of peak machine speed, and can be up to 4× faster. Given a matrix, kernel, and machine; our approach to selecting a fast implementation consists of two steps: (1) we identify and generate a space of reasonable implementations, and then (2) search this space for the fastest one using a combination of heuristic models and actual experiments (i.e., running and timing the code). We build on the SPARSITY system for generating highly-tuned implementations of the SpMV kernel y ← y + Ax, where A is a sparse matrix and x, y are dense vectors. We extend SPARSITY to support tuning for a variety of common non-zero patterns arising in practice, and for additional kernels like sparse triangular solve (SpTS) and computation of ATA·x (or AAT·x) and A ρ·x. We develop new models to compute, for particular data structures and kernels, the best absolute performance (e.g., Mflop/s) we might expect on a given matrix and machine. These performance upper bounds account for the cost of memory operations at all levels of the memory hierarchy, but assume ideal instruction scheduling and low-level tuning. We evaluate our performance with respect to such bounds, finding that the generated and tuned implementations of SpMV and SpTS achieve up to 75% of the performance bound. This finding places limits on the effectiveness of additional low-level tuning (e.g., better instruction selection and scheduling). (Abstract shortened by UMI.)