The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Templates for the solution of algebraic eigenvalue problems: a practical guide
Templates for the solution of algebraic eigenvalue problems: a practical guide
Optimizing the performance of sparse matrix-vector multiplication
Optimizing the performance of sparse matrix-vector multiplication
Algorithm 818: A reference model implementation of the sparse BLAS in fortran 95
ACM Transactions on Mathematical Software (TOMS)
An Improved Computation of the PageRank Algorithm
Proceedings of the 24th BCS-IRSG European Colloquium on IR Research: Advances in Information Retrieval
Memory-Intensive Benchmarks: IRAM vs. Cache-Based Machines
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Better tiling and array contraction for compiling scientific programs
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Performance optimizations and bounds for sparse matrix-vector multiply
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Compile-time composition of run-time data and iteration reorderings
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Metrics and models for reordering transformations
MSP '04 Proceedings of the 2004 workshop on Memory system performance
Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms
International Journal of High Performance Computing Applications
Sparse Tiling for Stationary Iterative Methods
International Journal of High Performance Computing Applications
Optimizing Sparse Matrix-Vector Product Computations Using Unroll and Jam
International Journal of High Performance Computing Applications
Optimizing irregular shared-memory applications for distributed-memory systems
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Online performance auditing: using hot optimizations without getting burned
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
An operation stacking framework for large ensemble computations
Proceedings of the 21st annual international conference on Supercomputing
Optimizing sparse matrix-vector multiplication using index and value compression
Proceedings of the 5th conference on Computing frontiers
Pattern-based sparse matrix representation for memory-efficient SMVM kernels
Proceedings of the 23rd international conference on Supercomputing
PetaBricks: a language and compiler for algorithmic choice
Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Autotuning multigrid with PetaBricks
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Performance evaluation of the sparse matrix-vector multiplication on modern architectures
The Journal of Supercomputing
Self-adapting numerical software and automatic tuning of heuristics
ICCS'03 Proceedings of the 2003 international conference on Computational science
Self-adapting numerical software and automatic tuning of heuristics
ICCS'03 Proceedings of the 2003 international conference on Computational science
Memory hierarchy optimizations and performance bounds for sparse ATAx
ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Operation Stacking for Ensemble Computations With Variable Convergence
International Journal of High Performance Computing Applications
Exploiting compression opportunities to improve SpMxV performance on shared memory systems
ACM Transactions on Architecture and Code Optimization (TACO)
CSX: an extended compression format for spmv on shared memory systems
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
An efficient evolutionary algorithm for solving incrementally structured problems
Proceedings of the 13th annual conference on Genetic and evolutionary computation
Exploiting dense substructures for fast sparse matrix vector multiplication
International Journal of High Performance Computing Applications
CRSD: application specific auto-tuning of SpMV for diagonal sparse matrices
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Two-dimensional cache-oblivious sparse matrix-vector multiplication
Parallel Computing
Performance tuning of matrix triple products based on matrix structure
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
High-performance sparse matrix-vector multiplication on GPUs for structured grid computations
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Language and compiler support for auto-tuning variable-accuracy algorithms
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Siblingrivalry: online autotuning through local competitions
Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems
Portable performance on heterogeneous architectures
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Accelerating sparse matrix-vector multiplication on GPUs using bit-representation-optimized schemes
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
Sparse matrix-vector multiplication is an important computational kernel that tends to perform poorly on modern processors, largely because of its high ratio of memory operations to arithmetic operations. Optimizing this algorithm is difficult, both because of the complexity of memory systems and because the performance is highly dependent on the nonzero structure of the matrix. The Sparsity system is designed to address these problem by allowing users to automatically build sparse matrix kernels that are tuned to their matrices and machines. The most difficult aspect of optimizing these algorithms is selecting among a large set of possible transformations and choosing parameters, such as block size. In this paper we discuss the optimization of two operations: a sparse matrix times a dense vector and a sparse matrix times a set of dense vectors. Our experience indicates that for matrices arising in scientific simulations, register level optimizations are critical, and we focus here on the optimizations and parameter selection techniques used in Sparsity for register-level optimizations. We demonstrate speedups of up to 2脳 for the single vector case and 5脳 for the multiple vector case.