Optimizing sparse matrix-vector multiplication using index and value compression
Proceedings of the 5th conference on Computing frontiers
Pattern-based sparse matrix representation for memory-efficient SMVM kernels
Proceedings of the 23rd international conference on Supercomputing
Parallel MLEM on Multicore Architectures
ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Haptic rendering of deformable objects using a multiple FPGA parallel computing architecture
Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays
Exploiting dense substructures for fast sparse matrix vector multiplication
International Journal of High Performance Computing Applications
Analyzing the execution of sparse matrix-vector product on the Finisterrae SMP-NUMA system
The Journal of Supercomputing
SimPL: an effective placement algorithm
Proceedings of the International Conference on Computer-Aided Design
Efficient matrix-encoded grammars and low latency parallelization strategies for CYK
IWPT '11 Proceedings of the 12th International Conference on Parsing Technologies
A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs
Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays
A Multiple-FPGA parallel computing architecture for real-time simulation of soft-object deformation
ACM Transactions on Embedded Computing Systems (TECS)
Hi-index | 0.00 |
In this paper we revisit the performance issues of the widely used sparse matrix-vector multiplication kernel on modern microarchitectures. Previous scientific work reports a number of different factors that may significantly reduce performance. However, the interaction of these factors with the underlying architectural characteristics is not clearly understood, a fact that may lead to misguided and thus unsuccessful attempts for optimization. In order to gain an insight on the details of performance, we conduct a suite of experiments on a rich set of matrices for three different commodity hardware platforms. Based on our experiments we extractuseful conclusions that can serve as guidelines for the subsequent optimization process of the kernel.