Characterizing the behavior of sparse algorithms on caches
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Improving performance of sparse matrix-vector multiplication
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Performance optimizations and bounds for sparse matrix-vector multiply
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Optimizing the performance of sparse matrix-vector multiplication
Optimizing the performance of sparse matrix-vector multiplication
Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply
ICPP '04 Proceedings of the 2004 International Conference on Parallel Processing
Automatic performance tuning of sparse matrix kernels
Automatic performance tuning of sparse matrix kernels
Sparsity: Optimization Framework for Sparse Matrix Kernels
International Journal of High Performance Computing Applications
Vectorized sparse matrix multiply for compressed row storage format
ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I
Accelerating sparse matrix computations via data compression
Proceedings of the 20th annual international conference on Supercomputing
An operation stacking framework for large ensemble computations
Proceedings of the 21st annual international conference on Supercomputing
Optimizing sparse matrix-vector multiplication using index and value compression
Proceedings of the 5th conference on Computing frontiers
Pattern-based sparse matrix representation for memory-efficient SMVM kernels
Proceedings of the 23rd international conference on Supercomputing
Performance evaluation of the sparse matrix-vector multiplication on modern architectures
The Journal of Supercomputing
Model-driven autotuning of sparse matrix-vector multiply on GPUs
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Operation Stacking for Ensemble Computations With Variable Convergence
International Journal of High Performance Computing Applications
International Journal of High Performance Computing Applications
Exact sparse matrix-vector multiplication on GPU's and multicore architectures
Proceedings of the 4th International Workshop on Parallel and Symbolic Computation
Exploiting compression opportunities to improve SpMxV performance on shared memory systems
ACM Transactions on Architecture and Code Optimization (TACO)
CSX: an extended compression format for spmv on shared memory systems
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Domain-Specific Optimization of Signal Recognition Targeting FPGAs
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Exploiting dense substructures for fast sparse matrix vector multiplication
International Journal of High Performance Computing Applications
CRSD: application specific auto-tuning of SpMV for diagonal sparse matrices
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Optimization of sparse matrix-vector multiplication using reordering techniques on GPUs
Microprocessors & Microsystems
Implementation and optimization of sparse matrix-vector multiplication on imagine stream processor
ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication
Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Accelerating sparse matrix-vector multiplication on GPUs using bit-representation-optimized schemes
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Sparse matrix-vector multiplication on the Single-Chip Cloud Computer many-core processor
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
We improve the performance of sparse matrix-vector multiplication(SpMV) on modern cache-based superscalar machines when the matrix structure consists of multiple, irregularly aligned rectangular blocks. Matrices from finite element modeling applications often have this structure. We split the matrix, A, into a sum, A1 + A2 + ... + As, where each term is stored in a new data structure we refer to as unaligned block compressed sparse row (UBCSR) format. A classical approach which stores A in a BCSR can also reduce execution time, but the improvements may be limited because BCSR imposes an alignment of the matrix non-zeros that leads to extra work from filled-in zeros. Combining splitting with UBCSR reduces this extra work while retaining the generally lower memory bandwidth requirements and register-level tiling opportunities of BCSR. We show speedups can be as high as 2.1× over no blocking, and as high as 1.8× over BCSR as used in prior work on a set of application matrices. Even when performance does not improve significantly, split UBCSR usually reduces matrix storage.