Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Authors:
Samuel Williams;Leonid Oliker;Richard Vuduc;John Shalf;Katherine Yelick;James Demmel
Affiliations:
Lawrence Berkeley National Laboratory, Berkeley, CA and University of California at Berkeley, Berkeley, CA;Lawrence Berkeley National Laboratory, Berkeley, CA;Lawrence Livermore National Laboratory, Livermore, CA;Lawrence Berkeley National Laboratory, Berkeley, CA;Lawrence Berkeley National Laboratory, Berkeley, CA and University of California at Berkeley, Berkeley, CA;University of California at Berkeley, Berkeley, CA
Venue:
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Year:
2007

Citing 16
Cited 75

Characterizing the behavior of sparse algorithms on caches

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Efficient management of parallelism in object-oriented numerical software libraries

Modern software tools for scientific computing
Improving performance of sparse matrix-vector multiplication

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Design Challenges of Technology Scaling

IEEE Micro
Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors

Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors
Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

ICPP '04 Proceedings of the 2004 International Conference on Parallel Processing
Automatic performance tuning of sparse matrix kernels

Automatic performance tuning of sparse matrix kernels
A Two-Dimensional Data Distribution Method for Parallel Sparse Matrix-Vector Multiplication

SIAM Review
Sparsity: Optimization Framework for Sparse Matrix Kernels

International Journal of High Performance Computing Applications
Chip multiprocessing and the cell broadband engine

Proceedings of the 3rd conference on Computing frontiers
Synergistic Processing in Cell's Multicore Architecture

IEEE Micro
Accelerating sparse matrix computations via data compression

Proceedings of the 20th annual international conference on Supercomputing
Computer Architecture, Fourth Edition: A Quantitative Approach

Computer Architecture, Fourth Edition: A Quantitative Approach
When cache blocking of sparse matrix vector multiply works and why

Applicable Algebra in Engineering, Communication and Computing
Scientific computing Kernels on the cell processor

International Journal of Parallel Programming
Memory hierarchy optimizations and performance bounds for sparse ATAx

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII

Optimizing sparse matrix-vector multiplication using index and value compression

Proceedings of the 5th conference on Computing frontiers
Adaptive runtime tuning of parallel sparse matrix-vector multiplication on distributed memory systems

Proceedings of the 22nd annual international conference on Supercomputing
Sparse matrix computations on manycore GPU's

Proceedings of the 45th annual Design Automation Conference
Algorithmic performance studies on graphics processing units

Journal of Parallel and Distributed Computing
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Roofline: an insightful visual performance model for multicore architectures

Communications of the ACM - A Direct Path to Dependable Software
Evaluation of Sparse LU Factorization and Triangular Solution on Multicore Platforms

High Performance Computing for Computational Science - VECPAR 2008
QR factorization for the Cell Broadband Engine

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Parallel Computing
Pattern-based sparse matrix representation for memory-efficient SMVM kernels

Proceedings of the 23rd international conference on Supercomputing
Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems

Proceedings of the 23rd international conference on Supercomputing
Thread motion: fine-grained power management for multi-core systems

Proceedings of the 36th annual international symposium on Computer architecture
Evaluation of the SUN UltraSparc T2+ Processor for Computational Science

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
A view of the parallel computing landscape

Communications of the ACM - A View of Parallel Computing
Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms

Journal of Parallel and Distributed Computing
Implementing Blocked Sparse Matrix-Vector Multiplication on NVIDIA GPUs

SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
Dynamic Load Balancing of Matrix-Vector Multiplications on Roadrunner Compute Nodes

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
GPU based sparse grid technique for solving multidimensional options pricing PDEs

Proceedings of the 2nd Workshop on High Performance Computational Finance
Multi-core acceleration of chemical kinetics for simulation and prediction

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A design methodology for domain-optimized power-efficient supercomputing

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Implementing sparse matrix-vector multiplication on throughput-oriented processors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Minimizing communication in sparse matrix solvers

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Performance evaluation of the sparse matrix-vector multiplication on modern architectures

The Journal of Supercomputing
Parallel algorithms for solving linear systems with sparse triangular matrices

Computing
Mining tree-structured data on multicore systems

Proceedings of the VLDB Endowment
Improving parallelism and locality with asynchronous algorithms

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Increasing the Locality of Iterative Methods and Its Application to the Simulation of Semiconductor Devices

International Journal of High Performance Computing Applications
State-of-the-art in heterogeneous computing

Scientific Programming
A compiler-automated array compression scheme for optimizing memory intensive programs

Proceedings of the 24th ACM International Conference on Supercomputing
Understanding sources of inefficiency in general-purpose chips

Proceedings of the 37th annual international symposium on Computer architecture
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Proceedings of the 37th annual international symposium on Computer architecture
From Sparse Matrix to Optimal GPU CUDA Sparse Matrix Vector Product Implementation

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Asymmetric flow control for data transfer in hybrid computing systems

IBM Journal of Research and Development
A case for machine learning to optimize multicore performance

HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
Optimizing collective communication on multicores

HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
Exploiting compression opportunities to improve SpMxV performance on shared memory systems

ACM Transactions on Architecture and Code Optimization (TACO)
Hierarchical Diagonal Blocking and Precision Reduction Applied to Combinatorial Multigrid

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Exploring a Novel Gathering Method for Finite Element Codes on the Cell/B.E. Architecture

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Implementation and performance analysis of parallel conjugate gradient on the cell broadband engine

IBM Journal of Research and Development
CSX: an extended compression format for spmv on shared memory systems

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Dynamically Adaptive Simulations with Minimal Memory Requirement—Solving the Shallow Water Equations Using Sierpinski Curves

SIAM Journal on Scientific Computing
On the performance of an algebraic multigrid solver on multicore clusters

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Tackling cache-line stealing effects using run-time adaptation

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Brainy: effective selection of data structures

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Understanding sources of ineffciency in general-purpose chips

Communications of the ACM
A model-driven partitioning and auto-tuning integrated framework for sparse matrix-vector multiplication on GPUs

Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery
Exploiting dense substructures for fast sparse matrix vector multiplication

International Journal of High Performance Computing Applications
CRSD: application specific auto-tuning of SpMV for diagonal sparse matrices

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Analyzing the execution of sparse matrix-vector product on the Finisterrae SMP-NUMA system

The Journal of Supercomputing
Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Establishing a Miniapp as a programmability proxy

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Optimization of sparse matrix-vector multiplication using reordering techniques on GPUs

Microprocessors & Microsystems
HICAMP: architectural support for efficient concurrency-safe shared structured data access

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Automatically tuning sparse matrix-vector multiplication for GPU architectures

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Parallelization and performance comparison of the conjugate gradient equation solver on multicore Cell and Xeon computers

Concurrency and Computation: Practice & Experience
A survey on hardware-aware and heterogeneous computing on multicore processors and accelerators

Concurrency and Computation: Practice & Experience
Sparse matrix-vector multiply on the HICAMP architecture

Proceedings of the 26th ACM international conference on Supercomputing
clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs

Proceedings of the 26th ACM international conference on Supercomputing
Analysis and performance estimation of the Conjugate Gradient method on multiple GPUs

Parallel Computing
Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform

The Journal of Supercomputing
Performance evaluation of sparse matrix products in UPC

The Journal of Supercomputing
SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
A framework for auto-tuning HDF5 applications

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
LINQits: big data on little clients

Proceedings of the 40th Annual International Symposium on Computer Architecture
Accelerating sparse matrix-vector multiplication on GPUs using bit-representation-optimized schemes

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Taming parallel I/O complexity with auto-tuning

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Scalability study of molecular dynamics simulation on Godson-T many-core architecture

Journal of Parallel and Distributed Computing
Improving performance of codes with large/irregular stride memory access patterns via high performance reconfigurable computers

Journal of Parallel and Distributed Computing
Sparse matrix-vector multiplication on the Single-Chip Cloud Computer many-core processor

Journal of Parallel and Distributed Computing
A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays
yaSpMV: yet another SpMV framework on GPUs

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Minimizing synchronizations in sparse iterative solvers for distributed supercomputers

Computers & Mathematics with Applications
Semi-supervised learning via sparse model

Neurocomputing
Algebraic flux correction for nonconforming finite element discretizations of scalar transport problems

Computing

Quantified Score

Hi-index	0.02

Visualization

Abstract

We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) - one of the most heavily used kernels in scientific computing - across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD dual-core and Intel quad-core designs, the heterogeneous STI Cell, as well as the first scientific study of the highly multithreaded Sun Niagara2. We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural tradeoffs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms.