Locality of Reference in LU Decomposition with Partial Pivoting

Authors:
Sivan Toledo
Affiliations:
-
Venue:
SIAM Journal on Matrix Analysis and Applications
Year:
1997

Citing 0
Cited 48

Architecture-cognizant divide and conquer algorithms

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
A recursive formulation of Cholesky factorization of a matrix in packed storage

ACM Transactions on Mathematical Software (TOMS)
Cache-oblivious priority queue and graph algorithm applications

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
A locality-preserving cache-oblivious dynamic dictionary

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
The design of I/O-efficient sparse direct solvers

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Recursive blocked algorithms for solving triangular systems—Part I: one-sided and coupled Sylvester-type matrix equations

ACM Transactions on Mathematical Software (TOMS)
Recursive Array Layouts and Fast Matrix Multiplication

IEEE Transactions on Parallel and Distributed Systems
Very large electronic structure calculations using an out-of-core filter-diagonalization method

Journal of Computational Physics
Parallel and Fully Recursive Multifrontal Supernodal Sparse Cholesky

ICCS '02 Proceedings of the International Conference on Computational Science-Part II
LAWRA Workshop: Linear Algebra with Recursive Algorithms: http: //lawra.uni-c.dk/lawra/

HPCN Europe 2000 Proceedings of the 8th International Conference on High-Performance Computing and Networking
LAWRA: Linear Algebra with Recursive Algorithms

PARA '00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia
High-Performance Library Software for QR Factorization

PARA '00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia
A Recursive Formulation of the Inversion of Symmetric Positive Definite Matrices in Packed Storage Data Format

PARA '02 Proceedings of the 6th International Conference on Applied Parallel Computing Advanced Scientific Computing
Code Generators for Automatic Tuning of Numerical Kernels: Experiences with FFTW

SAIG '00 Proceedings of the International Workshop on Semantics, Applications, and Implementation of Program Generation
Exponential Structures for Efficient Cache-Oblivious Algorithms

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Fractal Matrix Multiplication: A Case Study on Portability of Cache Performance

WAE '01 Proceedings of the 5th International Workshop on Algorithm Engineering
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Self-adapting software for numerical linear algebra and LAPACK for clusters

Parallel Computing - Special issue: Parallel and distributed scientific and engineering computing
The design and implementation of a new out-of-core sparse cholesky factorization method

ACM Transactions on Mathematical Software (TOMS)
Parallel and fully recursive multifrontal sparse Cholesky

Future Generation Computer Systems - Special issue: Selected numerical algorithms
Statistical Models for Empirical Search-Based Performance Tuning

International Journal of High Performance Computing Applications
Cache-oblivious dynamic programming

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
The memory behavior of cache oblivious stencil computations

The Journal of Supercomputing
Communication-efficient parallel generic pairwise elimination

Future Generation Computer Systems - Special section: Information engineering and enterprise architecture in distributed computing environments
Recursive approach in sparse matrix LU factorization

Scientific Programming
Cilk provides the "best overall productivity" for high performance computing: (and won the HPC challenge award to prove it)

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Programming with tiles

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Updating an LU Factorization with Pivoting

ACM Transactions on Mathematical Software (TOMS)
Communication avoiding Gaussian elimination

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Generalized matrix inversion is not harder than matrix multiplication

Journal of Computational and Applied Mathematics
Communication-optimal parallel and sequential Cholesky decomposition: extended abstract

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Applying recursion to serial and parallel QR factorization leads to better performance

IBM Journal of Research and Development
Scaling LAPACK panel operations using parallel cache assignment

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Algorithms for memory hierarchies: advanced lectures

Algorithms for memory hierarchies: advanced lectures
Using non-canonical array layouts in dense matrix operations

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
New data structures for matrices and specialized inner kernels: low overhead for high performance

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Solving path problems on the GPU

Parallel Computing
Managing the complexity of lookahead for LU factorization with pivoting

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Graph expansion and communication costs of fast matrix multiplication: regular submission

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Cache-Oblivious Algorithms

ACM Transactions on Algorithms (TALG)
Communication-optimal Parallel and Sequential Cholesky Decomposition

SIAM Journal on Scientific Computing
JuliusC: a practical approach for the analysis of divide-and-conquer algorithms

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Rapid development of high-performance out-of-core solvers

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Communication-optimal Parallel and Sequential QR and LU Factorizations

SIAM Journal on Scientific Computing
CALU: A Communication Optimal LU Factorization Algorithm

SIAM Journal on Matrix Analysis and Applications
Graph expansion and communication costs of fast matrix multiplication

Journal of the ACM (JACM)
Communication efficient gaussian elimination with partial pivoting using a shape morphing data layout

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Scaling LAPACK panel operations using parallel cache assignment

ACM Transactions on Mathematical Software (TOMS)

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper presents a new partitioned algorithm for LU decomposition with partial pivoting. The new algorithm, called the recursively partitioned algorithm, is based on a recursive partitioning of the matrix. The paper analyzes the locality of reference in the new algorithm and the locality of reference in a known and widely used partitioned algorithm for LU decomposition called the right-looking algorithm. The analysis reveals that the new algorithm performs a factor of $\Theta(\sqrt{M/n})$ fewer I/O operations (or cache misses) than the right-looking algorithm, where $n$ is the order of the matrix and $M$ is the size of primary memory. The analysis also determines the optimal block size for the right-looking algorithm. Experimental comparisons between the new algorithm and the right-looking algorithm show that an implementation of the new algorithm outperforms a similarly coded right-looking algorithm on six different RISC architectures, that the new algorithm performs fewer cache misses than any other algorithm tested, and that it benefits more from Strassen's matrix-multiplication algorithm.