Recursion leads to automatic variable blocking for dense linear-algebra algorithms

Authors:
F. G. Gustavson
Affiliations:
-
Venue:
IBM Journal of Research and Development
Year:
1997

Citing 3
Cited 83

Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms

IBM Journal of Research and Development
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
Solving Linear Systems on Vector and Shared Memory Computers

Solving Linear Systems on Vector and Shared Memory Computers

Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
Recursive array layouts and fast parallel matrix multiplication

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
AJaPACK: experiments in performance portable parallel Java numerical libraries

Proceedings of the ACM 2000 conference on Java Grande
Design and evaluation of a linear algebra package for Java

Proceedings of the ACM 2000 conference on Java Grande
Transforming loops to recursion for multi-level memory hierarchies

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Symbolic bounds analysis of pointers, array indices, and accessed memory regions

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
A comparison of three approaches to language, compiler, and library support for multidimensional arrays in Java

Proceedings of the 2001 joint ACM-ISCOPE conference on Java Grande
Language support for Morton-order matrices

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
A recursive formulation of Cholesky factorization of a matrix in packed storage

ACM Transactions on Mathematical Software (TOMS)
The NINJA project

Communications of the ACM
FLAME: Formal Linear Algebra Methods Environment

ACM Transactions on Mathematical Software (TOMS)
Synthesizing Transformations for Locality Enhancement of Imperfectly-Nested Loop Nests

International Journal of Parallel Programming
Recursive blocked algorithms for solving triangular systems—Part I: one-sided and coupled Sylvester-type matrix equations

ACM Transactions on Mathematical Software (TOMS)
Automatic Parallelization of Recursive Procedures

International Journal of Parallel Programming
Recursive Array Layouts and Fast Matrix Multiplication

IEEE Transactions on Parallel and Distributed Systems
A Family of High-Performance Matrix Multiplication Algorithms

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Parallel and Fully Recursive Multifrontal Supernodal Sparse Cholesky

ICCS '02 Proceedings of the International Conference on Computational Science-Part II
LAWRA Workshop: Linear Algebra with Recursive Algorithms: http: //lawra.uni-c.dk/lawra/

HPCN Europe 2000 Proceedings of the 8th International Conference on High-Performance Computing and Networking
High Performance Numerical Computing in Java: Language and Compiler Issues

LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
Recursion Unrolling for Divide and Conquer Programs

LCPC '00 Proceedings of the 13th International Workshop on Languages and Compilers for Parallel Computing-Revised Papers
A Matlab Just-In-time Compiler

LCPC '00 Proceedings of the 13th International Workshop on Languages and Compilers for Parallel Computing-Revised Papers
Parallel Triangular Sylvester-Type Matrix Equation Solvers for SMP Systems Using Recursive Blocking

PARA '00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia
LAWRA: Linear Algebra with Recursive Algorithms

PARA '00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia
High Performance Cholesky Factorization via Blocking and Recursion That Uses Minimal Storage

PARA '00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia
High-Performance Library Software for QR Factorization

PARA '00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia
A Fast Minimal Storage Symmetric Indefinite Solver

PARA '00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia
Parallel Two-Sided Sylvester-Type Matrix Equation Solvers for SMP Systems Using Recursive Blocking

PARA '02 Proceedings of the 6th International Conference on Applied Parallel Computing Advanced Scientific Computing
A Recursive Formulation of the Inversion of Symmetric Positive Definite Matrices in Packed Storage Data Format

PARA '02 Proceedings of the 6th International Conference on Applied Parallel Computing Advanced Scientific Computing
Code Generators for Automatic Tuning of Numerical Kernels: Experiences with FFTW

SAIG '00 Proceedings of the International Workshop on Semantics, Applications, and Implementation of Program Generation
New Generalized Data Structures for Matrices Lead to a Variety of High Performance Algorithms

PPAM '01 Proceedings of the th International Conference on Parallel Processing and Applied Mathematics-Revised Papers
Automatic Generation of Block-Recursive Codes

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Experience with a Recursive Perturbation Based Algorithm for Symmetric Indefinite Linear Systems

Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
Fractal Matrix Multiplication: A Case Study on Portability of Cache Performance

WAE '01 Proceedings of the 5th International Workshop on Algorithm Engineering
Design-Driven Compilation

CC '01 Proceedings of the 10th International Conference on Compiler Construction
Recursive Version of LU Decomposition

NAA '00 Revised Papers from the Second International Conference on Numerical Analysis and Its Applications
Self-adapting software for numerical linear algebra and LAPACK for clusters

Parallel Computing - Special issue: Parallel and distributed scientific and engineering computing
Parallel and fully recursive multifrontal sparse Cholesky

Future Generation Computer Systems - Special issue: Selected numerical algorithms
Java programming for high-performance numerical computing

IBM Systems Journal
High-performance linear algebra algorithms using new generalized data structures for matrices

IBM Journal of Research and Development
Mathematical sciences in the nineties

IBM Journal of Research and Development
Symbolic bounds analysis of pointers, array indices, and accessed memory regions

ACM Transactions on Programming Languages and Systems (TOPLAS)
A fully portable high performance minimal storage hybrid format Cholesky algorithm

ACM Transactions on Mathematical Software (TOMS)
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Recursive approach in sparse matrix LU factorization

Scientific Programming
NINJA: Java for high performance numerical computing

Scientific Programming
An experimental comparison of cache-oblivious and cache-conscious programs

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Multi-threading and one-sided communication in parallel LU factorization

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms

Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
Communication avoiding Gaussian elimination

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A class of parallel tiled linear algebra algorithms for multicore architectures

Parallel Computing
QR factorization for the Cell Broadband Engine

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Towards many-core implementation of LU decomposition using Peano Curves

Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop
Mapping the LU decomposition on a many-core architecture: challenges and solutions

Proceedings of the 6th ACM conference on Computing frontiers
Generalized matrix inversion is not harder than matrix multiplication

Journal of Computational and Applied Mathematics
Applying recursion to serial and parallel QR factorization leads to better performance

IBM Journal of Research and Development
Minimal-storage high-performance Cholesky factorization via blocking and recursion

IBM Journal of Research and Development
Scaling LAPACK panel operations using parallel cache assignment

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Cache oblivious matrix operations using Peano curves

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Minimal data copy for dense linear algebra factorization

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Using non-canonical array layouts in dense matrix operations

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Is cache-oblivious DGEMM viable?

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
The relevance of new data structure approaches for dense linear algebra in the new multi-core/many core environments

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Hardware-oriented implementation of cache oblivious matrix operations based on space-filling curves

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
New data structures for matrices and specialized inner kernels: low overhead for high performance

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Solving path problems on the GPU

Parallel Computing
Algorithm engineering: bridging the gap between algorithm theory and practice

Algorithm engineering: bridging the gap between algorithm theory and practice
Communication-optimal Parallel and Sequential Cholesky Decomposition

SIAM Journal on Scientific Computing
Optimizing matrix multiplication with a classifier learning system

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
A cache oblivious algorithm for matrix multiplication based on peano's space filling curve

PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
New generalized data structures for matrices lead to a variety of high performance dense linear algebra algorithms

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Management of deep memory hierarchies: recursive blocked algorithms and hybrid data structures for dense matrix computations

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
A matrix-type for performance–portability

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Cache blocking

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I
Cache-Oblivious algorithms and matrix formats for computations on interval matrices

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
CALU: A Communication Optimal LU Factorization Algorithm

SIAM Journal on Matrix Analysis and Applications
New level-3 BLAS kernels for cholesky factorization

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Cache blocking for linear algebra algorithms

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Graph expansion and communication costs of fast matrix multiplication

Journal of the ACM (JACM)
Level-3 Cholesky Factorization Routines Improve Performance of Many Cholesky Algorithms

ACM Transactions on Mathematical Software (TOMS)
Communication efficient gaussian elimination with partial pivoting using a shape morphing data layout

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Scaling LAPACK panel operations using parallel cache assignment

ACM Transactions on Mathematical Software (TOMS)
A multicore solution to Block---Toeplitz linear systems of equations

The Journal of Supercomputing

Quantified Score

Hi-index	0.02

Recursion leads to automatic variable blocking for dense linear-algebra algorithms

Quantified Score

Visualization

Abstract