Performance optimizations and bounds for sparse matrix-vector multiply

Authors:
Richard Vuduc;James W. Demmel;Katherine A. Yelick;Shoaib Kamil;Rajesh Nishtala;Benjamin Lee
Affiliations:
University of California, Berkeley, Berkeley, California;University of California, Berkeley, Berkeley, California;University of California, Berkeley, Berkeley, California;University of California, Berkeley, Berkeley, California;University of California, Berkeley, Berkeley, California;University of California, Berkeley, Berkeley, California
Venue:
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Year:
2002

Citing 19
Cited 31

A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Compiler blockability of numerical algorithms

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Characterizing the behavior of sparse algorithms on caches

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
CPU performance evaluation and execution time prediction using narrow spectrum benchmarking

CPU performance evaluation and execution time prediction using narrow spectrum benchmarking
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Block algorithms for sparse matrix computations on high performance workstations

ICS '96 Proceedings of the 10th international conference on Supercomputing
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
Matrix market: a web resource for test matrix collections

Proceedings of the IFIP TC2/WG2.5 working conference on Quality of numerical software: assessment and enhancement
A relational approach to the automatic generation of sequential sparse matrix codes

A relational approach to the automatic generation of sequential sparse matrix codes
Cache miss equations: a compiler framework for analyzing and tuning memory behavior

ACM Transactions on Programming Languages and Systems (TOPLAS)
Improving performance of sparse matrix-vector multiplication

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Automatic Nonzero Structure Analysis

SIAM Journal on Computing
A scalable cross-platform infrastructure for application performance tuning using hardware counters

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Exact analysis of the cache behavior of nested loops

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Modeling and Improving Locality for Irregular Problems: Sparse Matrix-Vector Product on Cache Memories as a Cache Study

HPCN Europe '99 Proceedings of the 7th International Conference on High-Performance Computing and Networking
Optimizing the performance of sparse matrix-vector multiplication

Optimizing the performance of sparse matrix-vector multiplication
On Improving the Performance of Sparse Matrix-Vector Multiplication

HIPC '97 Proceedings of the Fourth International Conference on High-Performance Computing

Adaptive History-Based Memory Schedulers

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Floating-point sparse matrix-vector multiply for FPGAs

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Sparse matrix storage revisited

Proceedings of the 2nd conference on Computing frontiers
Statistical Models for Empirical Search-Based Performance Tuning

International Journal of High Performance Computing Applications
Optimizing Sparse Matrix-Vector Product Computations Using Unroll and Jam

International Journal of High Performance Computing Applications
Performance optimization of irregular codes based on the combination of reordering and blocking techniques

Parallel Computing
Streamware: programming general-purpose multicore processors using streams

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Special Section: Parallel Graphics and Visualization: Parallel techniques for physically based simulation on multi-core processor architectures

Computers and Graphics
Optimizing sparse matrix-vector multiplication using index and value compression

Proceedings of the 5th conference on Computing frontiers
Roofline: an insightful visual performance model for multicore architectures

Communications of the ACM - A Direct Path to Dependable Software
Evaluation of Hierarchical Mesh Reorderings

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Performance evaluation of the sparse matrix-vector multiplication on modern architectures

The Journal of Supercomputing
Performance optimization of irregular codes based on the combination of reordering and blocking techniques

Parallel Computing
Edgepack: a parallel vertex and node reordering package for optimizing edge-based computations in unstructured grids

VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
Parallel blocked sparse matrix-vector multiplication with dynamic parameter selection method

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Memory hierarchy optimizations and performance bounds for sparse ATAx

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Sparse matrix-vector multiplication - final solution?

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Short note: Parallelizing a 3D finite difference MT inversion algorithm on a multicore PC using OpenMP

Computers & Geosciences
Exploiting compression opportunities to improve SpMxV performance on shared memory systems

ACM Transactions on Architecture and Code Optimization (TACO)
On improving performance and energy profiles of sparse scientific applications

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
CSX: an extended compression format for spmv on shared memory systems

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Optimizing Sparse Data Structures for Matrix-vector Multiply

International Journal of High Performance Computing Applications
Algorithm-based recovery for iterative methods without checkpointing

Proceedings of the 20th international symposium on High performance distributed computing
CRSD: application specific auto-tuning of SpMV for diagonal sparse matrices

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
A new diagonal blocking format and model of cache behavior for sparse matrices

PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Fast sparse matrix-vector multiplication by exploiting variable block structure

HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
HICAMP: architectural support for efficient concurrency-safe shared structured data access

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture

Parallel Computing
Sparse matrix-vector multiply on the HICAMP architecture

Proceedings of the 26th ACM international conference on Supercomputing
Cache-conscious performance optimization for similarity search

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Applications of the streamed storage format for sparse matrix operations

International Journal of High Performance Computing Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider performance tuning, by code and data structure reorganization, of sparse matrix-vector multiply (SpM × V), one of the most important computational kernels in scientific applications. This paper addresses the fundamental questions of what limits exist on such performance tuning, and how closely tuned code approaches these limits.Specifically, we develop upper and lower bounds on the performance (Mflop/s) of SpM × V when tuned using our previously proposed register blocking optimization. These bounds are based on the non-zero pattern in the matrix and the cost of basic memory operations, such as cache hits and misses. We evaluate our tuned implementations with respect to these bounds using hardware counter data on 4 different platforms and on a test set of 44 sparse matrices. We find that we can often get within 20% of the upper bound, particularly on a class of matrices from finite element modeling (FEM) problems; on non-FEM matrices, performance improvements of 2× are still possible. Lastly, we present a new heuristic that selects optimal or near-optimal register block sizes (the key tuning parameters) more accurately than our previous heuristic. Using the new heuristic, we show improvements in SpM × V performance (Mflop/s) by as much as 2.5× over an untuned implementation.Collectively, our results suggest that future performance improvements, beyond those that we have already demonstrated for SpM × V, will come from two sources: (1) consideration of higher-level matrix structures (e.g., exploiting symmetry, matrix reordering, multiple register block sizes), and (2) optimizing kernels with more opportunity for data reuse (e.g., sparse matrix-multiple vector multiply, multiplication of ATA by a vector).