Operation Stacking for Ensemble Computations With Variable Convergence

  • Authors:
  • Mehmet Belgin;Godmar Back;Calvin J. Ribbens

  • Affiliations:
  • DEPARTMENT OF COMPUTER SCIENCE, VIRGINIA TECH, 2202KRAFT DRIVE, BLACKSBURG, VA 24060, USA;DEPARTMENT OF COMPUTER SCIENCE, VIRGINIA TECH, 2202KRAFT DRIVE, BLACKSBURG, VA 24060, USA;DEPARTMENT OF COMPUTER SCIENCE, VIRGINIA TECH, 2202KRAFT DRIVE, BLACKSBURG, VA 24060, USA

  • Venue:
  • International Journal of High Performance Computing Applications
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Sparse matrix operations achieve only small fractions of peak CPU speeds because of the use of specialized, index-based matrix representations, which degrade cache utilization by imposing irregular memory accesses and increasing the number of overall accesses. Compounding the problem, the small number of floating-point operations in a single sparse iteration leads to low floating-point pipeline utilization. Operation stacking addresses these problems for large ensemble computations that solve multiple systems of linear equations with identical sparsity structure. By combining the data of multiple problems and solving them as one, operation stacking improves locality, reduces cache misses, and increases floating-point pipeline utilization. Operation stacking also requires less memory bandwidth because it involves fewer index array accesses. In this paper we present the Operation Stacking Framework (OSF), an object-oriented framework that provides runtime and code generation support for the development of stacked iterative solvers. OSFâ聙聶s runtime component provides an iteration engine that supports efficient ejection of converged problems from the stack. It separates the specific solver algorithm from the coding conventions and data representations that are necessary to implement stacking. Stacked solvers created with OSF can be used transparently without requiring significant changes to existing applications. Our results show that stacking can provide speedups up to 1.94脙聴 with an average of 1.46脙聴, even in scenarios in which the number of iterations required to converge varies widely within a stack of problems. Our evaluation shows that these improvements correlate with better cache utilization, improved floating-point utilization, and reduced memory accesses.