Architecture-cognizant divide and conquer algorithms
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
A recursive formulation of Cholesky factorization of a matrix in packed storage
ACM Transactions on Mathematical Software (TOMS)
Cache-oblivious priority queue and graph algorithm applications
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
A locality-preserving cache-oblivious dynamic dictionary
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
The design of I/O-efficient sparse direct solvers
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
ACM Transactions on Mathematical Software (TOMS)
Recursive Array Layouts and Fast Matrix Multiplication
IEEE Transactions on Parallel and Distributed Systems
Very large electronic structure calculations using an out-of-core filter-diagonalization method
Journal of Computational Physics
Parallel and Fully Recursive Multifrontal Supernodal Sparse Cholesky
ICCS '02 Proceedings of the International Conference on Computational Science-Part II
LAWRA Workshop: Linear Algebra with Recursive Algorithms: http: //lawra.uni-c.dk/lawra/
HPCN Europe 2000 Proceedings of the 8th International Conference on High-Performance Computing and Networking
LAWRA: Linear Algebra with Recursive Algorithms
PARA '00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia
High-Performance Library Software for QR Factorization
PARA '00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia
PARA '02 Proceedings of the 6th International Conference on Applied Parallel Computing Advanced Scientific Computing
Code Generators for Automatic Tuning of Numerical Kernels: Experiences with FFTW
SAIG '00 Proceedings of the International Workshop on Semantics, Applications, and Implementation of Program Generation
Exponential Structures for Efficient Cache-Oblivious Algorithms
ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Fractal Matrix Multiplication: A Case Study on Portability of Cache Performance
WAE '01 Proceedings of the 5th International Workshop on Algorithm Engineering
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Self-adapting software for numerical linear algebra and LAPACK for clusters
Parallel Computing - Special issue: Parallel and distributed scientific and engineering computing
The design and implementation of a new out-of-core sparse cholesky factorization method
ACM Transactions on Mathematical Software (TOMS)
Parallel and fully recursive multifrontal sparse Cholesky
Future Generation Computer Systems - Special issue: Selected numerical algorithms
Statistical Models for Empirical Search-Based Performance Tuning
International Journal of High Performance Computing Applications
Cache-oblivious dynamic programming
SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
The memory behavior of cache oblivious stencil computations
The Journal of Supercomputing
Communication-efficient parallel generic pairwise elimination
Future Generation Computer Systems - Special section: Information engineering and enterprise architecture in distributed computing environments
Recursive approach in sparse matrix LU factorization
Scientific Programming
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Updating an LU Factorization with Pivoting
ACM Transactions on Mathematical Software (TOMS)
Communication avoiding Gaussian elimination
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Generalized matrix inversion is not harder than matrix multiplication
Journal of Computational and Applied Mathematics
Communication-optimal parallel and sequential Cholesky decomposition: extended abstract
Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Applying recursion to serial and parallel QR factorization leads to better performance
IBM Journal of Research and Development
Scaling LAPACK panel operations using parallel cache assignment
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Algorithms for memory hierarchies: advanced lectures
Algorithms for memory hierarchies: advanced lectures
Using non-canonical array layouts in dense matrix operations
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
New data structures for matrices and specialized inner kernels: low overhead for high performance
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Solving path problems on the GPU
Parallel Computing
Managing the complexity of lookahead for LU factorization with pivoting
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Graph expansion and communication costs of fast matrix multiplication: regular submission
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
ACM Transactions on Algorithms (TALG)
Communication-optimal Parallel and Sequential Cholesky Decomposition
SIAM Journal on Scientific Computing
JuliusC: a practical approach for the analysis of divide-and-conquer algorithms
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Rapid development of high-performance out-of-core solvers
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Communication-optimal Parallel and Sequential QR and LU Factorizations
SIAM Journal on Scientific Computing
CALU: A Communication Optimal LU Factorization Algorithm
SIAM Journal on Matrix Analysis and Applications
Graph expansion and communication costs of fast matrix multiplication
Journal of the ACM (JACM)
Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Scaling LAPACK panel operations using parallel cache assignment
ACM Transactions on Mathematical Software (TOMS)
Hi-index | 0.01 |
This paper presents a new partitioned algorithm for LU decomposition with partial pivoting. The new algorithm, called the recursively partitioned algorithm, is based on a recursive partitioning of the matrix. The paper analyzes the locality of reference in the new algorithm and the locality of reference in a known and widely used partitioned algorithm for LU decomposition called the right-looking algorithm. The analysis reveals that the new algorithm performs a factor of $\Theta(\sqrt{M/n})$ fewer I/O operations (or cache misses) than the right-looking algorithm, where $n$ is the order of the matrix and $M$ is the size of primary memory. The analysis also determines the optimal block size for the right-looking algorithm. Experimental comparisons between the new algorithm and the right-looking algorithm show that an implementation of the new algorithm outperforms a similarly coded right-looking algorithm on six different RISC architectures, that the new algorithm performs fewer cache misses than any other algorithm tested, and that it benefits more from Strassen's matrix-multiplication algorithm.