Sparsity: Optimization Framework for Sparse Matrix Kernels

Authors:
Eun-Jin Im;Katherine Yelick;Richard Vuduc
Affiliations:
School of Computer Science Kookmin University, Seoul, Korea;Computer Science Division University of California, Berkeley, CA, USA;Computer Science Division University of California, Berkeley, CA, USA
Venue:
International Journal of High Performance Computing Applications
Year:
2004

Citing 13
Cited 62

The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A Shifted Block Lanczos Algorithm for Solving Sparse Symmetric Generalized Eigenproblems

SIAM Journal on Matrix Analysis and Applications
Using linear algebra for intelligent information retrieval

SIAM Review
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
Improving the memory-system performance of sparse-matrix vector multiplication

IBM Journal of Research and Development
Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
Improving performance of sparse matrix-vector multiplication

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Templates for the solution of algebraic eigenvalue problems: a practical guide

Templates for the solution of algebraic eigenvalue problems: a practical guide
Compiling parallel code for sparse matrix applications

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Modeling and Improving Locality for Irregular Problems: Sparse Matrix-Vector Product on Cache Memories as a Cache Study

HPCN Europe '99 Proceedings of the 7th International Conference on High-Performance Computing and Networking
Ordering Unstructured Meshes for Sparse Matrix Computations on Leading Parallel Systems

IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
Optimizing the performance of sparse matrix-vector multiplication

Optimizing the performance of sparse matrix-vector multiplication
Motion Segmentation and Tracking Using Normalized Cuts

ICCV '98 Proceedings of the Sixth International Conference on Computer Vision

Sparse Matrix-Vector multiplication on FPGAs

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Optimizing stream programs using linear state space analysis

Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems
Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
The potential of the cell processor for scientific computing

Proceedings of the 3rd conference on Computing frontiers
Online performance auditing: using hot optimizations without getting burned

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
ABCLibScript: a directive to support specification of an auto-tuning facility for numerical software

Parallel Computing
Performance optimization of irregular codes based on the combination of reordering and blocking techniques

Parallel Computing
Program generation for the all-pairs shortest path problem

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Accelerating sparse matrix computations via data compression

Proceedings of the 20th annual international conference on Supercomputing
FFT program generation for shared memory: SMP and multicore

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sparse Matrix Computations on Reconfigurable Hardware

Computer
An operation stacking framework for large ensemble computations

Proceedings of the 21st annual international conference on Supercomputing
Generation and optimisation of code using Coxeter lattice paths

Proceedings of the 2007 international workshop on Parallel symbolic computation
A Study of Architectural Optimization Methods in Bioinformatics Applications

International Journal of High Performance Computing Applications
Performance Optimization and Modeling of Blocked Sparse Kernels

International Journal of High Performance Computing Applications
Scientific computing Kernels on the cell processor

International Journal of Parallel Programming
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Adaptive runtime tuning of parallel sparse matrix-vector multiplication on distributed memory systems

Proceedings of the 22nd annual international conference on Supercomputing
Block-Based Approach to Solving Linear Systems

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part I: ICCS 2007
How to Write Fast Numerical Code: A Small Introduction

Generative and Transformational Techniques in Software Engineering II
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Parallel Computing
Pattern-based sparse matrix representation for memory-efficient SMVM kernels

Proceedings of the 23rd international conference on Supercomputing
Operator Language: A Program Generation Framework for Fast Kernels

DSL '09 Proceedings of the IFIP TC 2 Working Conference on Domain-Specific Languages
Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Implementing sparse matrix-vector multiplication on throughput-oriented processors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Minimizing communication in sparse matrix solvers

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Model-driven autotuning of sparse matrix-vector multiply on GPUs

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Performance optimization of irregular codes based on the combination of reordering and blocking techniques

Parallel Computing
Parallel symmetric sparse matrix-vector product on scalar multi-core CPUs

Parallel Computing
performance/energy optimization of dsp transforms on the XScale processor

HiPEAC'07 Proceedings of the 2nd international conference on High performance embedded architectures and compilers
Operation Stacking for Ensemble Computations With Variable Convergence

International Journal of High Performance Computing Applications
Increasing the Locality of Iterative Methods and Its Application to the Simulation of Semiconductor Devices

International Journal of High Performance Computing Applications
From Sparse Matrix to Optimal GPU CUDA Sparse Matrix Vector Product Implementation

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
On the limits of GPU acceleration

HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
An input-centric paradigm for program dynamic optimizations

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Hierarchical Diagonal Blocking and Precision Reduction Applied to Combinatorial Multigrid

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
AARTS: low overhead online adaptive auto-tuning

Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
A model-driven partitioning and auto-tuning integrated framework for sparse matrix-vector multiplication on GPUs

Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery
Exploiting dense substructures for fast sparse matrix vector multiplication

International Journal of High Performance Computing Applications
A step towards transparent integration of input-consciousness into dynamic program optimizations

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Automatic performance programming

Proceedings of the 10th SIGPLAN symposium on New ideas, new paradigms, and reflections on programming and software
Performance evaluation of storage formats for sparse matrices in fortran

HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
Fast sparse matrix-vector multiplication by exploiting variable block structure

HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Empirical performance model-driven data layout optimization and library call selection for tensor contraction expressions

Journal of Parallel and Distributed Computing
A parallel algebraic multigrid solver on graphics processing units

HPCA'09 Proceedings of the Second international conference on High Performance Computing and Applications
Performance tuning of matrix triple products based on matrix structure

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Optimization of sparse matrix-vector multiplication using reordering techniques on GPUs

Microprocessors & Microsystems
Storage formats for sparse matrices in java

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I
High-performance sparse matrix-vector multiplication on GPUs for structured grid computations

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Sparse matrix-vector multiply on the HICAMP architecture

Proceedings of the 26th ACM international conference on Supercomputing
clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs

Proceedings of the 26th ACM international conference on Supercomputing
SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
When polyhedral transformations meet SIMD code generation

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Efficient sparse matrix-vector multiplication on x86-based many-core processors

Proceedings of the 27th international ACM conference on International conference on supercomputing
Improving performance of codes with large/irregular stride memory access patterns via high performance reconfigurable computers

Journal of Parallel and Distributed Computing
Sparse matrix-vector multiplication on the Single-Chip Cloud Computer many-core processor

Journal of Parallel and Distributed Computing
Fast iterative graph computation with block updates

Proceedings of the VLDB Endowment
An Infrastructure for Tackling Input-Sensitivity of GPU Program Optimizations

International Journal of Parallel Programming
Minimizing synchronizations in sparse iterative solvers for distributed supercomputers

Computers & Mathematics with Applications
Applications of the streamed storage format for sparse matrix operations

International Journal of High Performance Computing Applications
Amesos2 and Belos: Direct and iterative solvers for large sparse linear systems

Scientific Programming
Empirical Installation of Linear Algebra Shared-Memory Subroutines for Auto-Tuning

International Journal of Parallel Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Sparse matrix-vector multiplication is an important computational kernel that performs poorly on most modern processors due to a low compute-to-memory ratio and irregular memory access patterns. Optimization is difficult because of the complexity of cache-based memory systems and because performance is highly dependent on the non-zero structure of the matrix. The SPARSITY system is designed to address these problems by allowing users to automatically build sparse matrix kernels that are tuned to their matrices and machines. SPARSITY combines traditional techniques such as loop transformations with data structure transformations and optimization heuristics that are specific to sparse matrices. It provides a novel framework for selecting optimization parameters, such as block size, using a combination of performance models and search. In this paper we discuss the optimization of two operations: a sparse matrix times a dense vector and a sparse matrix times a set of dense vectors. Our experience indicates that register level optimizations are effective for matrices arising in certain scientific simulations, in particular finite-element problems. Cache level optimizations are important when the vector used in multiplication is larger than the cache size, especially for matrices in which the non-zero structure is random. For applications involving multiple vectors, reorganizing the computation to perform the entire set of multiplications as a single operation produces significant speedups. We describe the different optimizations and parameter selection techniques and evaluate them on several machines using over 40 matrices taken from a broad set of application domains. Our results demonstrate speedups of up to 4X for the single vector case and up to 10X for the multiple vector case.