A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
LAPACK: a portable linear algebra library for high-performance computers
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Journal of Parallel and Distributed Computing - Special issue on scalability of parallel algorithms and architectures
Improving the ratio of memory operations to floating-point operations in loops
ACM Transactions on Programming Languages and Systems (TOPLAS)
Compiler transformations for high-performance computing
ACM Computing Surveys (CSUR)
Matrix computations (3rd ed.)
Improving the memory-system performance of sparse-matrix vector multiplication
IBM Journal of Research and Development
Improving performance of sparse matrix-vector multiplication
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Achieving high sustained performance in an unstructured mesh CFD application
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Numerical Linear Algebra for High Performance Computers
Numerical Linear Algebra for High Performance Computers
International Journal of Parallel Programming
Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Iterative Methods for Sparse Linear Systems
Iterative Methods for Sparse Linear Systems
A Portable Programming Interface for Performance Evaluation on Modern Processors
International Journal of High Performance Computing Applications
Sparsity: Optimization Framework for Sparse Matrix Kernels
International Journal of High Performance Computing Applications
Optimizing Sparse Matrix-Vector Product Computations Using Unroll and Jam
International Journal of High Performance Computing Applications
Exploiting Locality for Irregular Scientific Codes
IEEE Transactions on Parallel and Distributed Systems
Accelerating sparse matrix computations via data compression
Proceedings of the 20th annual international conference on Supercomputing
Fast sparse matrix-vector multiplication by exploiting variable block structure
HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Operation Stacking for Ensemble Computations With Variable Convergence
International Journal of High Performance Computing Applications
Hi-index | 0.00 |
Iterative solutions of sparse problems often achieve only a small fraction of the peak theoretical performance on modern architectures. This problem is highly challenging because sparse matrix storage schemes require data to be accessed irregularly, which leads to massive cache misses. Furthermore, the inner loop of typical sparse matrix operations accesses only a small and variable amount of data, which not only leads to low utilization of floating point registers, but also prevents optimization techniques that improve instruction level parallelism (ILP), such as unroll and jam. Although a general solution to this problem has not been found, significant performance improvements can be made for at least one important special case, namely large ensemble computations, which run the same application repeatedly on different data sets. In this paper, we present the Operation Stacking Framework (OSF), which runs multiple sparse problems simultaneously, stacking their data and solving them as one, thus improving both cache and ILP utilization. Programmers can use stacked solvers transparently in their applications. Moreover, OSF provides an API that makes it simple to convert existing solvers such as the conjugate gradient (CG) and generalized minimal residual (GMRES) methods into a stacked form. Our experimental results show that stacking can reduce the number of L2 misses by 25% to 44%, resulting in performance improvements of up to 1.95x with an average of 1.60x for stacked CG and GMRES algorithms on a single CPU.