Floating-point sparse matrix-vector multiply for FPGAs
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
The potential of the cell processor for scientific computing
Proceedings of the 3rd conference on Computing frontiers
Optimal sparse matrix dense vector multiplication in the I/O-model
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Performance Optimization and Modeling of Blocked Sparse Kernels
International Journal of High Performance Computing Applications
Scientific computing Kernels on the cell processor
International Journal of Parallel Programming
Optimization of sparse matrix-vector multiplication on emerging multicore platforms
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Pattern-based sparse matrix representation for memory-efficient SMVM kernels
Proceedings of the 23rd international conference on Supercomputing
A design methodology for domain-optimized power-efficient supercomputing
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Implementing sparse matrix-vector multiplication on throughput-oriented processors
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Minimizing communication in sparse matrix solvers
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Model-driven autotuning of sparse matrix-vector multiply on GPUs
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
A simulation of large-scale groundwater flow on CUDA-enabled GPUs
Proceedings of the 2010 ACM Symposium on Applied Computing
Cache friendly sparse matrix-vector multiplication
Proceedings of the 4th International Workshop on Parallel and Symbolic Computation
A case for machine learning to optimize multicore performance
HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
Specifying and verifying sparse matrix codes
Proceedings of the 15th ACM SIGPLAN international conference on Functional programming
Parallel implementation of conjugate gradient method on graphics processors
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Evaluating non-square sparse bilinear forms on multiple vector pairs in the I/O-model
MFCS'10 Proceedings of the 35th international conference on Mathematical foundations of computer science
Fast sparse matrix-vector multiplication on GPUs: implications for graph mining
Proceedings of the VLDB Endowment
Optimizing Sparse Data Structures for Matrix-vector Multiply
International Journal of High Performance Computing Applications
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
CRSD: application specific auto-tuning of SpMV for diagonal sparse matrices
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
GPU accelerated CAE using open solvers and the cloud
ACM SIGARCH Computer Architecture News
Fast sparse matrix-vector multiplication by exploiting variable block structure
HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Automatically tuning sparse matrix-vector multiplication for GPU architectures
HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs
Proceedings of the 26th ACM international conference on Supercomputing
Improving GPU sparse matrix-vector multiplication for probabilistic model checking
SPIN'12 Proceedings of the 19th international conference on Model Checking Software
Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform
The Journal of Supercomputing
Application-tailored linear algebra algorithms: A search-based approach
International Journal of High Performance Computing Applications
Hi-index | 0.00 |
This dissertation presents an automated system to generate highly efficient, platform-adapted implementations of sparse matrix kernels. We show that conventional implementations of important sparse kernels like sparse matrix-vector multiply (SpMV) have historically run at 10% or less of peak machine speed on cache-based superscalar architectures. Our implementations of SpMV, automatically tuned using a methodology based on empirical-search, can by contrast achieve up to 31% of peak machine speed, and can be up to 4× faster. Given a matrix, kernel, and machine; our approach to selecting a fast implementation consists of two steps: (1) we identify and generate a space of reasonable implementations, and then (2) search this space for the fastest one using a combination of heuristic models and actual experiments (i.e., running and timing the code). We build on the SPARSITY system for generating highly-tuned implementations of the SpMV kernel y ← y + Ax, where A is a sparse matrix and x, y are dense vectors. We extend SPARSITY to support tuning for a variety of common non-zero patterns arising in practice, and for additional kernels like sparse triangular solve (SpTS) and computation of ATA·x (or AAT·x) and A ρ·x. We develop new models to compute, for particular data structures and kernels, the best absolute performance (e.g., Mflop/s) we might expect on a given matrix and machine. These performance upper bounds account for the cost of memory operations at all levels of the memory hierarchy, but assume ideal instruction scheduling and low-level tuning. We evaluate our performance with respect to such bounds, finding that the generated and tuned implementations of SpMV and SpTS achieve up to 75% of the performance bound. This finding places limits on the effectiveness of additional low-level tuning (e.g., better instruction selection and scheduling). (Abstract shortened by UMI.)