The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A Shifted Block Lanczos Algorithm for Solving Sparse Symmetric Generalized Eigenproblems
SIAM Journal on Matrix Analysis and Applications
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
Improving the memory-system performance of sparse-matrix vector multiplication
IBM Journal of Research and Development
Computer architecture (2nd ed.): a quantitative approach
Computer architecture (2nd ed.): a quantitative approach
Improving performance of sparse matrix-vector multiplication
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Templates for the solution of algebraic eigenvalue problems: a practical guide
Templates for the solution of algebraic eigenvalue problems: a practical guide
Compiling parallel code for sparse matrix applications
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
HPCN Europe '99 Proceedings of the 7th International Conference on High-Performance Computing and Networking
Ordering Unstructured Meshes for Sparse Matrix Computations on Leading Parallel Systems
IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
Optimizing the performance of sparse matrix-vector multiplication
Optimizing the performance of sparse matrix-vector multiplication
Motion Segmentation and Tracking Using Normalized Cuts
ICCV '98 Proceedings of the Sixth International Conference on Computer Vision
Sparse Matrix-Vector multiplication on FPGAs
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Optimizing stream programs using linear state space analysis
Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems
Automatic Tuning Matrix Multiplication Performance on Graphics Hardware
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
The potential of the cell processor for scientific computing
Proceedings of the 3rd conference on Computing frontiers
Online performance auditing: using hot optimizations without getting burned
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Program generation for the all-pairs shortest path problem
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Accelerating sparse matrix computations via data compression
Proceedings of the 20th annual international conference on Supercomputing
FFT program generation for shared memory: SMP and multicore
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
An operation stacking framework for large ensemble computations
Proceedings of the 21st annual international conference on Supercomputing
Generation and optimisation of code using Coxeter lattice paths
Proceedings of the 2007 international workshop on Parallel symbolic computation
A Study of Architectural Optimization Methods in Bioinformatics Applications
International Journal of High Performance Computing Applications
Performance Optimization and Modeling of Blocked Sparse Kernels
International Journal of High Performance Computing Applications
Scientific computing Kernels on the cell processor
International Journal of Parallel Programming
Optimization of sparse matrix-vector multiplication on emerging multicore platforms
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Proceedings of the 22nd annual international conference on Supercomputing
Block-Based Approach to Solving Linear Systems
ICCS '07 Proceedings of the 7th international conference on Computational Science, Part I: ICCS 2007
How to Write Fast Numerical Code: A Small Introduction
Generative and Transformational Techniques in Software Engineering II
Pattern-based sparse matrix representation for memory-efficient SMVM kernels
Proceedings of the 23rd international conference on Supercomputing
Operator Language: A Program Generation Framework for Fast Kernels
DSL '09 Proceedings of the IFIP TC 2 Working Conference on Domain-Specific Languages
Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Implementing sparse matrix-vector multiplication on throughput-oriented processors
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Minimizing communication in sparse matrix solvers
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Model-driven autotuning of sparse matrix-vector multiply on GPUs
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Parallel symmetric sparse matrix-vector product on scalar multi-core CPUs
Parallel Computing
performance/energy optimization of dsp transforms on the XScale processor
HiPEAC'07 Proceedings of the 2nd international conference on High performance embedded architectures and compilers
Operation Stacking for Ensemble Computations With Variable Convergence
International Journal of High Performance Computing Applications
International Journal of High Performance Computing Applications
From Sparse Matrix to Optimal GPU CUDA Sparse Matrix Vector Product Implementation
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
On the limits of GPU acceleration
HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
An input-centric paradigm for program dynamic optimizations
Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Hierarchical Diagonal Blocking and Precision Reduction Applied to Combinatorial Multigrid
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
AARTS: low overhead online adaptive auto-tuning
Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery
Exploiting dense substructures for fast sparse matrix vector multiplication
International Journal of High Performance Computing Applications
A step towards transparent integration of input-consciousness into dynamic program optimizations
Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Automatic performance programming
Proceedings of the 10th SIGPLAN symposium on New ideas, new paradigms, and reflections on programming and software
Performance evaluation of storage formats for sparse matrices in fortran
HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
Fast sparse matrix-vector multiplication by exploiting variable block structure
HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Journal of Parallel and Distributed Computing
A parallel algebraic multigrid solver on graphics processing units
HPCA'09 Proceedings of the Second international conference on High Performance Computing and Applications
Performance tuning of matrix triple products based on matrix structure
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Optimization of sparse matrix-vector multiplication using reordering techniques on GPUs
Microprocessors & Microsystems
Storage formats for sparse matrices in java
ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I
High-performance sparse matrix-vector multiplication on GPUs for structured grid computations
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Sparse matrix-vector multiply on the HICAMP architecture
Proceedings of the 26th ACM international conference on Supercomputing
clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs
Proceedings of the 26th ACM international conference on Supercomputing
SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication
Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
When polyhedral transformations meet SIMD code generation
Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Efficient sparse matrix-vector multiplication on x86-based many-core processors
Proceedings of the 27th international ACM conference on International conference on supercomputing
Journal of Parallel and Distributed Computing
Sparse matrix-vector multiplication on the Single-Chip Cloud Computer many-core processor
Journal of Parallel and Distributed Computing
Fast iterative graph computation with block updates
Proceedings of the VLDB Endowment
An Infrastructure for Tackling Input-Sensitivity of GPU Program Optimizations
International Journal of Parallel Programming
Minimizing synchronizations in sparse iterative solvers for distributed supercomputers
Computers & Mathematics with Applications
Applications of the streamed storage format for sparse matrix operations
International Journal of High Performance Computing Applications
Amesos2 and Belos: Direct and iterative solvers for large sparse linear systems
Scientific Programming
Empirical Installation of Linear Algebra Shared-Memory Subroutines for Auto-Tuning
International Journal of Parallel Programming
Hi-index | 0.00 |
Sparse matrix-vector multiplication is an important computational kernel that performs poorly on most modern processors due to a low compute-to-memory ratio and irregular memory access patterns. Optimization is difficult because of the complexity of cache-based memory systems and because performance is highly dependent on the non-zero structure of the matrix. The SPARSITY system is designed to address these problems by allowing users to automatically build sparse matrix kernels that are tuned to their matrices and machines. SPARSITY combines traditional techniques such as loop transformations with data structure transformations and optimization heuristics that are specific to sparse matrices. It provides a novel framework for selecting optimization parameters, such as block size, using a combination of performance models and search. In this paper we discuss the optimization of two operations: a sparse matrix times a dense vector and a sparse matrix times a set of dense vectors. Our experience indicates that register level optimizations are effective for matrices arising in certain scientific simulations, in particular finite-element problems. Cache level optimizations are important when the vector used in multiplication is larger than the cache size, especially for matrices in which the non-zero structure is random. For applications involving multiple vectors, reorganizing the computation to perform the entire set of multiplications as a single operation produces significant speedups. We describe the different optimizations and parameter selection techniques and evaluate them on several machines using over 40 matrices taken from a broad set of application domains. Our results demonstrate speedups of up to 4X for the single vector case and up to 10X for the multiple vector case.