A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
LAPACK: a portable linear algebra library for high-performance computers
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Improving the memory-system performance of sparse-matrix vector multiplication
IBM Journal of Research and Development
Improving performance of sparse matrix-vector multiplication
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Achieving high sustained performance in an unstructured mesh CFD application
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Numerical Linear Algebra for High Performance Computers
Numerical Linear Algebra for High Performance Computers
Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Iterative Methods for Sparse Linear Systems
Iterative Methods for Sparse Linear Systems
A Portable Programming Interface for Performance Evaluation on Modern Processors
International Journal of High Performance Computing Applications
Sparsity: Optimization Framework for Sparse Matrix Kernels
International Journal of High Performance Computing Applications
Optimizing Sparse Matrix-Vector Product Computations Using Unroll and Jam
International Journal of High Performance Computing Applications
On Improving Linear Solver Performance: A Block Variant of GMRES
SIAM Journal on Scientific Computing
Accelerating sparse matrix computations via data compression
Proceedings of the 20th annual international conference on Supercomputing
Computer Architecture, Fourth Edition: A Quantitative Approach
Computer Architecture, Fourth Edition: A Quantitative Approach
An operation stacking framework for large ensemble computations
Proceedings of the 21st annual international conference on Supercomputing
Optimizing sparse matrix-vector multiplication using index and value compression
Proceedings of the 5th conference on Computing frontiers
$\mathcal{H}_2$ Model Reduction for Large-Scale Linear Dynamical Systems
SIAM Journal on Matrix Analysis and Applications
Fast sparse matrix-vector multiplication by exploiting variable block structure
HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Hi-index | 0.00 |
Sparse matrix operations achieve only small fractions of peak CPU speeds because of the use of specialized, index-based matrix representations, which degrade cache utilization by imposing irregular memory accesses and increasing the number of overall accesses. Compounding the problem, the small number of floating-point operations in a single sparse iteration leads to low floating-point pipeline utilization. Operation stacking addresses these problems for large ensemble computations that solve multiple systems of linear equations with identical sparsity structure. By combining the data of multiple problems and solving them as one, operation stacking improves locality, reduces cache misses, and increases floating-point pipeline utilization. Operation stacking also requires less memory bandwidth because it involves fewer index array accesses. In this paper we present the Operation Stacking Framework (OSF), an object-oriented framework that provides runtime and code generation support for the development of stacked iterative solvers. OSFâ聙聶s runtime component provides an iteration engine that supports efficient ejection of converged problems from the stack. It separates the specific solver algorithm from the coding conventions and data representations that are necessary to implement stacking. Stacked solvers created with OSF can be used transparently without requiring significant changes to existing applications. Our results show that stacking can provide speedups up to 1.94脙聴 with an average of 1.46脙聴, even in scenarios in which the number of iterations required to converge varies widely within a stack of problems. Our evaluation shows that these improvements correlate with better cache utilization, improved floating-point utilization, and reduced memory accesses.