Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY

Authors:
Eun-Jin Im;Katherine A. Yelick
Affiliations:
-;-
Venue:
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Year:
2001

Citing 3
Cited 34

The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Templates for the solution of algebraic eigenvalue problems: a practical guide

Templates for the solution of algebraic eigenvalue problems: a practical guide
Optimizing the performance of sparse matrix-vector multiplication

Optimizing the performance of sparse matrix-vector multiplication

Algorithm 818: A reference model implementation of the sparse BLAS in fortran 95

ACM Transactions on Mathematical Software (TOMS)
An Improved Computation of the PageRank Algorithm

Proceedings of the 24th BCS-IRSG European Colloquium on IR Research: Advances in Information Retrieval
Memory-Intensive Benchmarks: IRAM vs. Cache-Based Machines

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Better tiling and array contraction for compiling scientific programs

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Performance optimizations and bounds for sparse matrix-vector multiply

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Compile-time composition of run-time data and iteration reorderings

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Metrics and models for reordering transformations

MSP '04 Proceedings of the 2004 workshop on Memory system performance
Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms

International Journal of High Performance Computing Applications
Sparse Tiling for Stationary Iterative Methods

International Journal of High Performance Computing Applications
Optimizing Sparse Matrix-Vector Product Computations Using Unroll and Jam

International Journal of High Performance Computing Applications
Optimizing irregular shared-memory applications for distributed-memory systems

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Online performance auditing: using hot optimizations without getting burned

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
An operation stacking framework for large ensemble computations

Proceedings of the 21st annual international conference on Supercomputing
Optimizing sparse matrix-vector multiplication using index and value compression

Proceedings of the 5th conference on Computing frontiers
Pattern-based sparse matrix representation for memory-efficient SMVM kernels

Proceedings of the 23rd international conference on Supercomputing
PetaBricks: a language and compiler for algorithmic choice

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Autotuning multigrid with PetaBricks

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Performance evaluation of the sparse matrix-vector multiplication on modern architectures

The Journal of Supercomputing
Self-adapting numerical software and automatic tuning of heuristics

ICCS'03 Proceedings of the 2003 international conference on Computational science
Self-adapting numerical software and automatic tuning of heuristics

ICCS'03 Proceedings of the 2003 international conference on Computational science
Memory hierarchy optimizations and performance bounds for sparse ATAx

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Operation Stacking for Ensemble Computations With Variable Convergence

International Journal of High Performance Computing Applications
Exploiting compression opportunities to improve SpMxV performance on shared memory systems

ACM Transactions on Architecture and Code Optimization (TACO)
CSX: an extended compression format for spmv on shared memory systems

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
An efficient evolutionary algorithm for solving incrementally structured problems

Proceedings of the 13th annual conference on Genetic and evolutionary computation
Exploiting dense substructures for fast sparse matrix vector multiplication

International Journal of High Performance Computing Applications
CRSD: application specific auto-tuning of SpMV for diagonal sparse matrices

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Two-dimensional cache-oblivious sparse matrix-vector multiplication

Parallel Computing
Performance tuning of matrix triple products based on matrix structure

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
High-performance sparse matrix-vector multiplication on GPUs for structured grid computations

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Language and compiler support for auto-tuning variable-accuracy algorithms

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Siblingrivalry: online autotuning through local competitions

Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems
Portable performance on heterogeneous architectures

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Accelerating sparse matrix-vector multiplication on GPUs using bit-representation-optimized schemes

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Sparse matrix-vector multiplication is an important computational kernel that tends to perform poorly on modern processors, largely because of its high ratio of memory operations to arithmetic operations. Optimizing this algorithm is difficult, both because of the complexity of memory systems and because the performance is highly dependent on the nonzero structure of the matrix. The Sparsity system is designed to address these problem by allowing users to automatically build sparse matrix kernels that are tuned to their matrices and machines. The most difficult aspect of optimizing these algorithms is selecting among a large set of possible transformations and choosing parameters, such as block size. In this paper we discuss the optimization of two operations: a sparse matrix times a dense vector and a sparse matrix times a set of dense vectors. Our experience indicates that for matrices arising in scientific simulations, register level optimizations are critical, and we focus here on the optimizations and parameter selection techniques used in Sparsity for register-level optimizations. We demonstrate speedups of up to 2脳 for the single vector case and 5脳 for the multiple vector case.