Using non-canonical array layouts in dense matrix operations

Authors:
José R. Herrero;Juan J. Navarro
Affiliations:
Computer Architecture Dept., Univ. Politècnica de Catalunya, Barcelona, Spain;Computer Architecture Dept., Univ. Politècnica de Catalunya, Barcelona, Spain
Venue:
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Year:
2006

Citing 26
Cited 2

Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
MOB forms: a class of multilevel block algorithms for dense linear algebra operations

ICS '94 Proceedings of the 8th international conference on Supercomputing
Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms

IBM Journal of Research and Development
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Locality of Reference in LU Decomposition with Partial Pivoting

SIAM Journal on Matrix Analysis and Applications
Recursion leads to automatic variable blocking for dense linear-algebra algorithms

IBM Journal of Research and Development
Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
Recursive array layouts and fast parallel matrix multiplication

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Organizing matrices and matrix operations for paged memory systems

Communications of the ACM
A recursive formulation of Cholesky factorization of a matrix in packed storage

ACM Transactions on Mathematical Software (TOMS)
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Recursive Blocked Data Formats and BLAS's for Dense Linear Algebra Algorithms

PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
A Recursive Formulation of the Inversion of Symmetric Positive Definite Matrices in Packed Storage Data Format

PARA '02 Proceedings of the 6th International Conference on Applied Parallel Computing Advanced Scientific Computing
Tiling, Block Data Layout, and Memory Hierarchy Performance

IEEE Transactions on Parallel and Distributed Systems
High-performance linear algebra algorithms using new generalized data structures for matrices

IBM Journal of Research and Development
A Tile Size Selection Analysis for Blocked Array Layouts

INTERACT '05 Proceedings of the 9th Annual Workshop on Interaction between Compilers and Computer Architectures
A fully portable high performance minimal storage hybrid format Cholesky algorithm

ACM Transactions on Mathematical Software (TOMS)
Cache oblivious matrix operations using Peano curves

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Is cache-oblivious DGEMM viable?

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
A study on load imbalance in parallel hypermatrix multiplication using OpenMP

PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Adapting linear algebra codes to the memory hierarchy using a hypermatrix scheme

PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Compiler-optimized kernels: an efficient alternative to hand-coded inner kernels

ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part V
New generalized data structures for matrices lead to a variety of high performance dense linear algebra algorithms

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing

Hypermatrix oriented supernode amalgamation

The Journal of Supercomputing
New data structures for matrices and specialized inner kernels: low overhead for high performance

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present two implementations of dense matrix multiplication based on two different non-canonical array layouts: one based on a hypermatrix data structure (HM) where data submatrices are stored using a recursive layout; the other based on a simple block data layout with square blocks (SB) where blocks are arranged in column-major order. We show that the iterative code using SB outperforms a recursive code using HM and obtains competitive results on a variety of platforms.