Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion

Authors:
Fred Gustavson;Lars Karlsson;Bo Kågström
Affiliations:
IBM T.J. Watson Research Center, Emeritus, and Umeå University;Umeå University;Umeå University
Venue:
ACM Transactions on Mathematical Software (TOMS)
Year:
2012

Citing 18
Cited 8

The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Efficient transposition algorithms for large matrices

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Permuting In Place

SIAM Journal on Computing
Transporting a matrix on a vector computer

Parallel Computing
A Method for Transposing a Matrix

Journal of the ACM (JACM)
Array Permutation by Index-Digit Permutation

Journal of the ACM (JACM)
Algorithm 467: Matrix Transposition in Place

Communications of the ACM
Algorithm 513: Analysis of In-Situ Transposition [F1]

ACM Transactions on Mathematical Software (TOMS)
Algorithm 380: in-situ transposition of a rectangular matrix [F1]

Communications of the ACM
Algorithm 302: Transpose vector stored array

Communications of the ACM
Recursive Blocked Data Formats and BLAS's for Dense Linear Algebra Algorithms

PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
New Generalized Matrix Data Structures Lead to a Variety of High-Performance Algorithms

Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software
Tiling, Block Data Layout, and Memory Hierarchy Performance

IEEE Transactions on Parallel and Distributed Systems
A Computer Algorithm for Transposing Nonsquare Matrices

IEEE Transactions on Computers
Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms

Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
Distributed SBP Cholesky factorization algorithms with near-optimal scheduling

ACM Transactions on Mathematical Software (TOMS)
QR factorization for the Cell Broadband Engine

Scientific Programming - High Performance Computing with the Cell Broadband Engine
In-place transposition of rectangular matrices

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing

Cache blocking

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I
New level-3 BLAS kernels for cholesky factorization

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Cache blocking for linear algebra algorithms

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Level-3 Cholesky Factorization Routines Improve Performance of Many Cholesky Algorithms

ACM Transactions on Mathematical Software (TOMS)
Scaling LAPACK panel operations using parallel cache assignment

ACM Transactions on Mathematical Software (TOMS)
An improved parallel singular value algorithm and its implementation for multicore hardware

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A decomposition for in-place matrix transposition

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
In-place transposition of rectangular matrices on accelerators

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Techniques and algorithms for efficient in-place conversion to and from standard and blocked matrix storage formats are described. Such functionality is required by numerical libraries that use different data layouts internally. Parallel algorithms and a software package for in-place matrix storage format conversion based on in-place matrix transposition are presented and evaluated. A new algorithm for in-place transposition which efficiently determines the structure of the transposition permutation a priori is one of the key ingredients. It enables effective load balancing in a parallel environment.