Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms
IBM Journal of Research and Development
Recursion leads to automatic variable blocking for dense linear-algebra algorithms
IBM Journal of Research and Development
Recursive Blocked Data Formats and BLAS's for Dense Linear Algebra Algorithms
PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
Applying recursion to serial and parallel QR factorization leads to better performance
IBM Journal of Research and Development
Minimal-storage high-performance Cholesky factorization via blocking and recursion
IBM Journal of Research and Development
Recursive Array Layouts and Fast Matrix Multiplication
IEEE Transactions on Parallel and Distributed Systems
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Parallel tiled QR factorization for multicore architectures
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Evaluating linear recursive filters using novel data formats for dense matrices
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
A parallel non-square tiled algorithm for solving a kind of BVP for second-order ODEs
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Journal of Computational and Applied Mathematics
Minimizing associativity conflicts in morton layout
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Compiler-optimized kernels: an efficient alternative to hand-coded inner kernels
ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part V
Hi-index | 0.00 |
We describe new data structures for full and packed storage of dense symmetric/triangular arrays that generalize both. Using the new data structures one is led to several new algorithms that save "half" the storage for symmetric matrices and outperform the current blocked based level 3 algorithms in LAPACK. We concentrate on the simplest forms of the new algorithms and show they are a direct generalization of LINPACK. This means that level 3 BLAS's are not required to obtain level 3 performance. The replacement for Level 3 BLAS are so-called kernel routines, see [1], and on IBM platforms they are producible from simple textbook type codes, by the XLF Fortran compiler. In the sequel I will label these "vanilla" codes. On Power3 with a peak performance of 800 MFlops, the results for Cholesky factorization at order n 驴 200 is over 720 MFlops and then reaches 735 MFlops at n = 400. Using conventional full format LAPACK DPOTRF with ESSL BLAS's one first gets to 600 MFlops at n 驴 600 and only reaches a peak of 620 MFlops. For this result we used simple square blocked full matrix data formats where the blocks themselves are stored in column major (Fortran) order or row major (C) order. The simple algorithms of LU factorization with partial pivoting for this new data format is a direct generalization of LINPACK algorithm DGEFA. Again, no conventional level 3 BLAS's are required; the replacements are again so-called kernel routines. Programming for squared blocked full matrix format can be accomplished in standard Fortran through the use of three and four dimensional arrays. Thus, no new compiler support is necessary. Also we mention that other more complicated algorithms are possible; e.g., recursive ones. The recursive algorithms are also easily programmed via the use of tables that address where the blocks are stored in the two dimensional recursive block array. Finally, we describe block hybrid formats. Doing so allows one to use no additional storage over conventional (full and packed) matrix storage. This means the new algorithms are completely portable.