An operation stacking framework for large ensemble computations

Authors:
Mehmet Belgin;Calvin J. Ribbens;Godmar Back
Affiliations:
State University, Blacksburg, VA;State University, Blacksburg, VA;State University, Blacksburg, VA
Venue:
Proceedings of the 21st annual international conference on Supercomputing
Year:
2007

Citing 19
Cited 1

A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
LAPACK: a portable linear algebra library for high-performance computers

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Communication optimizations for irregular scientific computations on distributed memory architectures

Journal of Parallel and Distributed Computing - Special issue on scalability of parallel algorithms and architectures
Improving the ratio of memory operations to floating-point operations in loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
Compiler transformations for high-performance computing

ACM Computing Surveys (CSUR)
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
Improving the memory-system performance of sparse-matrix vector multiplication

IBM Journal of Research and Development
Improving performance of sparse matrix-vector multiplication

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Achieving high sustained performance in an unstructured mesh CFD application

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Numerical Linear Algebra for High Performance Computers

Numerical Linear Algebra for High Performance Computers
Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings

International Journal of Parallel Programming
Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Iterative Methods for Sparse Linear Systems

Iterative Methods for Sparse Linear Systems
A Portable Programming Interface for Performance Evaluation on Modern Processors

International Journal of High Performance Computing Applications
Sparsity: Optimization Framework for Sparse Matrix Kernels

International Journal of High Performance Computing Applications
Optimizing Sparse Matrix-Vector Product Computations Using Unroll and Jam

International Journal of High Performance Computing Applications
Exploiting Locality for Irregular Scientific Codes

IEEE Transactions on Parallel and Distributed Systems
Accelerating sparse matrix computations via data compression

Proceedings of the 20th annual international conference on Supercomputing
Fast sparse matrix-vector multiplication by exploiting variable block structure

HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications

Operation Stacking for Ensemble Computations With Variable Convergence

International Journal of High Performance Computing Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Iterative solutions of sparse problems often achieve only a small fraction of the peak theoretical performance on modern architectures. This problem is highly challenging because sparse matrix storage schemes require data to be accessed irregularly, which leads to massive cache misses. Furthermore, the inner loop of typical sparse matrix operations accesses only a small and variable amount of data, which not only leads to low utilization of floating point registers, but also prevents optimization techniques that improve instruction level parallelism (ILP), such as unroll and jam. Although a general solution to this problem has not been found, significant performance improvements can be made for at least one important special case, namely large ensemble computations, which run the same application repeatedly on different data sets. In this paper, we present the Operation Stacking Framework (OSF), which runs multiple sparse problems simultaneously, stacking their data and solving them as one, thus improving both cache and ILP utilization. Programmers can use stacked solvers transparently in their applications. Moreover, OSF provides an API that makes it simple to convert existing solvers such as the conjugate gradient (CG) and generalized minimal residual (GMRES) methods into a stacked form. Our experimental results show that stacking can reduce the number of L2 misses by 25% to 44%, resulting in performance improvements of up to 1.95x with an average of 1.60x for stacked CG and GMRES algorithms on a single CPU.