An extended set of FORTRAN basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Design of the IBM RISC System/6000 floating-point execution unit
IBM Journal of Research and Development
Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms
IBM Journal of Research and Development
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
Recursion leads to automatic variable blocking for dense linear-algebra algorithms
IBM Journal of Research and Development
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
A survey of out-of-core algorithms in numerical linear algebra
External memory algorithms
Matrix analysis and applied linear algebra
Matrix analysis and applied linear algebra
Basic Linear Algebra Subprograms for Fortran Usage
ACM Transactions on Mathematical Software (TOMS)
New trends in high performance computing
Parallel Computing - new trends in high performance computing
A recursive formulation of Cholesky factorization of a matrix in packed storage
ACM Transactions on Mathematical Software (TOMS)
Accuracy and Stability of Numerical Algorithms
Accuracy and Stability of Numerical Algorithms
A Family of High-Performance Matrix Multiplication Algorithms
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Recursive Blocked Data Formats and BLAS's for Dense Linear Algebra Algorithms
PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
PARA '02 Proceedings of the 6th International Conference on Applied Parallel Computing Advanced Scientific Computing
Experience with a Recursive Perturbation Based Algorithm for Symmetric Indefinite Linear Systems
Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
New Generalized Matrix Data Structures Lead to a Variety of High-Performance Algorithms
Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software
Applying recursion to serial and parallel QR factorization leads to better performance
IBM Journal of Research and Development
Minimal-storage high-performance Cholesky factorization via blocking and recursion
IBM Journal of Research and Development
Mathematical sciences in the nineties
IBM Journal of Research and Development
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Dynamic tiling for effective use of shared caches on multithreaded processors
International Journal of High Performance Computing and Networking
A unified model for multicore architectures
IFMT '08 Proceedings of the 1st international forum on Next-generation multicore/manycore technologies
Petascale computing with accelerators
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Programming the Linpack benchmark for the IBM PowerXCell 8i processor
Scientific Programming - High Performance Computing with the Cell Broadband Engine
Cache-optimal algorithms for option pricing
ACM Transactions on Mathematical Software (TOMS)
Evaluating multicore algorithms on the unified memory model
Scientific Programming - Software Development for Multi-core Computing Systems
Rectangular full packed format for cholesky's algorithm: factorization, solution, and inversion
ACM Transactions on Mathematical Software (TOMS)
Minimal data copy for dense linear algebra factorization
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Using non-canonical array layouts in dense matrix operations
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Is cache-oblivious DGEMM viable?
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
New data structures for matrices and specialized inner kernels: low overhead for high performance
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Evaluating linear recursive filters using novel data formats for dense matrices
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Optimization of BLAS on the cell processor
HiPC'08 Proceedings of the 15th international conference on High performance computing
Programming the Linpack benchmark for Roadrunner
IBM Journal of Research and Development
Applying process migration on a BSP-based LU decomposition application
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I
New level-3 BLAS kernels for cholesky factorization
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Cache blocking for linear algebra algorithms
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Level-3 Cholesky Factorization Routines Improve Performance of Many Cholesky Algorithms
ACM Transactions on Mathematical Software (TOMS)
Hi-index | 0.00 |
We present a novel way to produce dense linear algebra factorization algorithms. The current state-of-the-art (SOA) dense linear algebra algorithms have a performance inefficiency, and thus they give suboptimal performance for most LAPACK factorizations. We show that using standard Fortran and C two-dimensional arrays is the main source of this inefficiency. For the other standard format (packed one-dimensional arrays for symmetric and/or triangular matrices), the situation is much worse. We show how to correct these performance inefficiencies by using new data structures (NDS) along with so-called kernel routines. The NDS generalize the current storage layouts for both standard formats. We use the concept of Equivalence and Elementary Matrices along with coordinate (linear) transformations to prove that our method works for an entire class of dense linear algebra algorithms. Also, we use the Algorithms and Architecture approach to explain why our new method gives higher efficiency. The simplest forms of the new factorization algorithms are a direct generalization of the commonly used LINPACK algorithms. On IBM platforms they can be generated from simple, textbook-type codes by the XLF Fortran compiler. On the IBM POWER3 processor, our implementation of Cholesky factorization achieves 92% of peak performance, whereas conventional SOA full-format LAPACK DPOTRF achieves 77% of peak performance. All programming for our NDS can be accomplished in standard Fortran through the use of three- and four-dimensional arrays. Thus, no new compiler support is necessary. Finally, we describe block hybrid formats (BHF). BHF allow one to use no additional storage over conventional (full and packed) matrix storage. This means that new algorithms based on BHF can be used as a backward-compatible replacement for LAPACK or LINPACK algorithms.