Automatic blocking of QR and LU factorizations for locality

Authors:
Qing Yi;Ken Kennedy;Haihang You;Keith Seymour;Jack Dongarra
Affiliations:
Lawrence Livermore National Laboratory, Livermore, CA;Rice University, Houston, TX;University of Tennessee, Knoxville, TN;University of Tennessee, Knoxville, TN;University of Tennessee, Knoxville, TN
Venue:
MSP '04 Proceedings of the 2004 workshop on Memory system performance
Year:
2004

Citing 21
Cited 5

Strategies for cache and local memory management by global program transformation

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Uniform techniques for loop optimization

ICS '91 Proceedings of the 5th international conference on Supercomputing
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Compiler blockability of numerical algorithms

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Improving the ratio of memory operations to floating-point operations in loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
The Omega Library interface guide

The Omega Library interface guide
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Compiler blockability of dense matrix factorizations

ACM Transactions on Mathematical Software (TOMS)
Data transformations for eliminating conflict misses

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
An affine partitioning algorithm to maximize parallelism and minimize communication

ICS '99 Proceedings of the 13th international conference on Supercomputing
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Fast greedy weighted fusion

Proceedings of the 14th international conference on Supercomputing
Synthesizing transformations for locality enhancement of imperfectly-nested loop nests

Proceedings of the 14th international conference on Supercomputing
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
Optimizing Supercompilers for Supercomputers

Optimizing Supercompilers for Supercomputers
Hierarchical tiling for improved superscalar performance

IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Quantifying the Multi-level Nature of Tiling Interactions

LCPC '97 Proceedings of the 10th International Workshop on Languages and Compilers for Parallel Computing
Transforming Complex Loop Nests for Locality

The Journal of Supercomputing

Self-adapting numerical software (SANS) effort

IBM Journal of Research and Development
An FPGA-based computation model for blocked algorithms

AIC'06 Proceedings of the 6th WSEAS International Conference on Applied Informatics and Communications
LU decomposition on cell broadband engine: an empirical study to exploit heterogeneous chip multiprocessors

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Applying data copy to improve memory performance of general array computations

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures

ACM Transactions on Mathematical Software (TOMS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

QR and LU factorizations for dense matrices are important linear algebra computations that are widely used in scientific applications. To efficiently perform these computations on modern computers, the factorization algorithms need to be blocked when operating on large matrices to effectively exploit the deep cache hierarchy prevalent in today's computer memory systems. Because both QR (based on Householder transformations) and LU factorization algorithms contain complex loop structures, few compilers can fully automate the blocking of these algorithms. Though linear algebra libraries such as LAPACK provides manually blocked implementations of these algorithms, by automatically generating blocked versions of the computations, more benefit can be gained such as automatic adaptation of different blocking strategies. This paper demonstrates how to apply an aggressive loop transformation technique, dependence hoisting, to produce efficient blockings for both QR and LU with partial pivoting. We present different blocking strategies that can be generated by our optimizer and compare the performance of auto-blocked versions with manually tuned versions in LAPACK, both using reference BLAS, ATLAS BLAS and native BLAS specially tuned for the underlying machine architectures.