Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Authors:
Samuel Williams;Leonid Oliker;Richard Vuduc;John Shalf;Katherine Yelick;James Demmel
Affiliations:
CRD/NERSC, Lawrence Berkeley National Laboratory, One Cyclotron Rd., MS:50A-1148, Berkeley, CA 94720, USA and Computer Science Division, University of California at Berkeley, Berkeley, CA 94720, U ...;CRD/NERSC, Lawrence Berkeley National Laboratory, One Cyclotron Rd., MS:50A-1148, Berkeley, CA 94720, USA;College of Computing, Georgia Institute of Technology, Atlanta, GA 30332-0765, USA;CRD/NERSC, Lawrence Berkeley National Laboratory, One Cyclotron Rd., MS:50A-1148, Berkeley, CA 94720, USA;CRD/NERSC, Lawrence Berkeley National Laboratory, One Cyclotron Rd., MS:50A-1148, Berkeley, CA 94720, USA and Computer Science Division, University of California at Berkeley, Berkeley, CA 94720, U ...;Computer Science Division, University of California at Berkeley, Berkeley, CA 94720, USA
Venue:
Parallel Computing
Year:
2009

Citing 20
Cited 28

Characterizing the behavior of sparse algorithms on caches

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Efficient management of parallelism in object-oriented numerical software libraries

Modern software tools for scientific computing
Improving performance of sparse matrix-vector multiplication

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Design Challenges of Technology Scaling

IEEE Micro
Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors

Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors
On Improving the Performance of Sparse Matrix-Vector Multiplication

HIPC '97 Proceedings of the Fourth International Conference on High-Performance Computing
Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

ICPP '04 Proceedings of the 2004 International Conference on Parallel Processing
Automatic performance tuning of sparse matrix kernels

Automatic performance tuning of sparse matrix kernels
A Two-Dimensional Data Distribution Method for Parallel Sparse Matrix-Vector Multiplication

SIAM Review
Sparse Tiling for Stationary Iterative Methods

International Journal of High Performance Computing Applications
Sparsity: Optimization Framework for Sparse Matrix Kernels

International Journal of High Performance Computing Applications
Chip multiprocessing and the cell broadband engine

Proceedings of the 3rd conference on Computing frontiers
Synergistic Processing in Cell's Multicore Architecture

IEEE Micro
Accelerating sparse matrix computations via data compression

Proceedings of the 20th annual international conference on Supercomputing
Computer Architecture, Fourth Edition: A Quantitative Approach

Computer Architecture, Fourth Edition: A Quantitative Approach
When cache blocking of sparse matrix vector multiply works and why

Applicable Algebra in Engineering, Communication and Computing
Scientific computing Kernels on the cell processor

International Journal of Parallel Programming
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Optimizing sparse matrix-vector multiplication using index and value compression

Proceedings of the 5th conference on Computing frontiers
Memory hierarchy optimizations and performance bounds for sparse ATAx

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII

Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Model-driven autotuning of sparse matrix-vector multiply on GPUs

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Parallel symmetric sparse matrix-vector product on scalar multi-core CPUs

Parallel Computing
Adjacency-based data reordering algorithm for acceleration of finite element computations

Scientific Programming
On the limits of GPU acceleration

HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Parallelization of the artificial bee colony (ABC) algorithm

NN'10/EC'10/FS'10 Proceedings of the 11th WSEAS international conference on nural networks and 11th WSEAS international conference on evolutionary computing and 11th WSEAS international conference on Fuzzy systems
Parallel implementation of conjugate gradient method on graphics processors

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Accelerating Haskell array codes with multicore GPUs

Proceedings of the sixth workshop on Declarative aspects of multicore programming
Fast sparse matrix-vector multiplication on GPUs: implications for graph mining

Proceedings of the VLDB Endowment
Optimizing Sparse Data Structures for Matrix-vector Multiply

International Journal of High Performance Computing Applications
Automatically generating and tuning GPU code for sparse matrix-vector multiplication from a high-level representation

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
On the performance of an algebraic multigrid solver on multicore clusters

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Considerations when evaluating microprocessor platforms

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
A scalable eigensolver for large scale-free graphs using 2D graph partitioning

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Parallel breadth-first search on distributed memory systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture

Parallel Computing
Towards efficient execution of erasure codes on multicore architectures

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Efficient matrix-encoded grammars and low latency parallelization strategies for CYK

IWPT '11 Proceedings of the 12th International Conference on Parsing Technologies
Automatic restructuring of GPU kernels for exploiting inter-thread data locality

CC'12 Proceedings of the 21st international conference on Compiler Construction
An object-oriented bulk synchronous parallel library for multicore programming

Concurrency and Computation: Practice & Experience
Automatic tuning of the sparse matrix vector product on GPUs based on the ELLR-T approach

Parallel Computing
Input-aware auto-tuning for directive-based GPU programming

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Optimising purely functional GPU programs

Proceedings of the 18th ACM SIGPLAN international conference on Functional programming
Vectorized OpenCL implementation of numerical integration for higher order finite elements

Computers & Mathematics with Applications
Non-affine Extensions to Polyhedral Code Generation

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Compiled multithreaded data paths on FPGAs for dynamic workloads

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
Applications of the streamed storage format for sparse matrix operations

International Journal of High Performance Computing Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific-optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) - one of the most heavily used kernels in scientific computing - across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD quad-core, AMD dual-core, and Intel quad-core designs, the heterogeneous STI Cell, as well as one of the first scientific studies of the highly multithreaded Sun Victoria Falls (a Niagara2 SMP). We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural trade-offs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms.