A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Compiler blockability of numerical algorithms
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Characterizing the behavior of sparse algorithms on caches
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
CPU performance evaluation and execution time prediction using narrow spectrum benchmarking
CPU performance evaluation and execution time prediction using narrow spectrum benchmarking
Improving data locality with loop transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
Block algorithms for sparse matrix computations on high performance workstations
ICS '96 Proceedings of the 10th international conference on Supercomputing
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
Matrix market: a web resource for test matrix collections
Proceedings of the IFIP TC2/WG2.5 working conference on Quality of numerical software: assessment and enhancement
A relational approach to the automatic generation of sequential sparse matrix codes
A relational approach to the automatic generation of sequential sparse matrix codes
Cache miss equations: a compiler framework for analyzing and tuning memory behavior
ACM Transactions on Programming Languages and Systems (TOPLAS)
Improving performance of sparse matrix-vector multiplication
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Automatic Nonzero Structure Analysis
SIAM Journal on Computing
A scalable cross-platform infrastructure for application performance tuning using hardware counters
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Exact analysis of the cache behavior of nested loops
Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Automatically tuned linear algebra software
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
HPCN Europe '99 Proceedings of the 7th International Conference on High-Performance Computing and Networking
Optimizing the performance of sparse matrix-vector multiplication
Optimizing the performance of sparse matrix-vector multiplication
On Improving the Performance of Sparse Matrix-Vector Multiplication
HIPC '97 Proceedings of the Fourth International Conference on High-Performance Computing
Adaptive History-Based Memory Schedulers
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Floating-point sparse matrix-vector multiply for FPGAs
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Sparse matrix storage revisited
Proceedings of the 2nd conference on Computing frontiers
Statistical Models for Empirical Search-Based Performance Tuning
International Journal of High Performance Computing Applications
Optimizing Sparse Matrix-Vector Product Computations Using Unroll and Jam
International Journal of High Performance Computing Applications
Streamware: programming general-purpose multicore processors using streams
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Optimizing sparse matrix-vector multiplication using index and value compression
Proceedings of the 5th conference on Computing frontiers
Roofline: an insightful visual performance model for multicore architectures
Communications of the ACM - A Direct Path to Dependable Software
Evaluation of Hierarchical Mesh Reorderings
ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Performance evaluation of the sparse matrix-vector multiplication on modern architectures
The Journal of Supercomputing
VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
Parallel blocked sparse matrix-vector multiplication with dynamic parameter selection method
ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Memory hierarchy optimizations and performance bounds for sparse ATAx
ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Sparse matrix-vector multiplication - final solution?
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Exploiting compression opportunities to improve SpMxV performance on shared memory systems
ACM Transactions on Architecture and Code Optimization (TACO)
On improving performance and energy profiles of sparse scientific applications
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
CSX: an extended compression format for spmv on shared memory systems
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Optimizing Sparse Data Structures for Matrix-vector Multiply
International Journal of High Performance Computing Applications
Algorithm-based recovery for iterative methods without checkpointing
Proceedings of the 20th international symposium on High performance distributed computing
CRSD: application specific auto-tuning of SpMV for diagonal sparse matrices
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
A new diagonal blocking format and model of cache behavior for sparse matrices
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Fast sparse matrix-vector multiplication by exploiting variable block structure
HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
HICAMP: architectural support for efficient concurrency-safe shared structured data access
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Sparse matrix-vector multiply on the HICAMP architecture
Proceedings of the 26th ACM international conference on Supercomputing
Cache-conscious performance optimization for similarity search
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Applications of the streamed storage format for sparse matrix operations
International Journal of High Performance Computing Applications
Hi-index | 0.00 |
We consider performance tuning, by code and data structure reorganization, of sparse matrix-vector multiply (SpM × V), one of the most important computational kernels in scientific applications. This paper addresses the fundamental questions of what limits exist on such performance tuning, and how closely tuned code approaches these limits.Specifically, we develop upper and lower bounds on the performance (Mflop/s) of SpM × V when tuned using our previously proposed register blocking optimization. These bounds are based on the non-zero pattern in the matrix and the cost of basic memory operations, such as cache hits and misses. We evaluate our tuned implementations with respect to these bounds using hardware counter data on 4 different platforms and on a test set of 44 sparse matrices. We find that we can often get within 20% of the upper bound, particularly on a class of matrices from finite element modeling (FEM) problems; on non-FEM matrices, performance improvements of 2× are still possible. Lastly, we present a new heuristic that selects optimal or near-optimal register block sizes (the key tuning parameters) more accurately than our previous heuristic. Using the new heuristic, we show improvements in SpM × V performance (Mflop/s) by as much as 2.5× over an untuned implementation.Collectively, our results suggest that future performance improvements, beyond those that we have already demonstrated for SpM × V, will come from two sources: (1) consideration of higher-level matrix structures (e.g., exploiting symmetry, matrix reordering, multiple register block sizes), and (2) optimizing kernels with more opportunity for data reuse (e.g., sparse matrix-multiple vector multiply, multiplication of ATA by a vector).