The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Efficient transposition algorithms for large matrices
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
SIAM Journal on Computing
Transporting a matrix on a vector computer
Parallel Computing
A Method for Transposing a Matrix
Journal of the ACM (JACM)
Array Permutation by Index-Digit Permutation
Journal of the ACM (JACM)
Algorithm 467: Matrix Transposition in Place
Communications of the ACM
Algorithm 513: Analysis of In-Situ Transposition [F1]
ACM Transactions on Mathematical Software (TOMS)
Algorithm 380: in-situ transposition of a rectangular matrix [F1]
Communications of the ACM
Algorithm 302: Transpose vector stored array
Communications of the ACM
Recursive Blocked Data Formats and BLAS's for Dense Linear Algebra Algorithms
PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
New Generalized Matrix Data Structures Lead to a Variety of High-Performance Algorithms
Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software
Tiling, Block Data Layout, and Memory Hierarchy Performance
IEEE Transactions on Parallel and Distributed Systems
A Computer Algorithm for Transposing Nonsquare Matrices
IEEE Transactions on Computers
Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms
Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
Distributed SBP Cholesky factorization algorithms with near-optimal scheduling
ACM Transactions on Mathematical Software (TOMS)
QR factorization for the Cell Broadband Engine
Scientific Programming - High Performance Computing with the Cell Broadband Engine
In-place transposition of rectangular matrices
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I
New level-3 BLAS kernels for cholesky factorization
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Cache blocking for linear algebra algorithms
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Level-3 Cholesky Factorization Routines Improve Performance of Many Cholesky Algorithms
ACM Transactions on Mathematical Software (TOMS)
Scaling LAPACK panel operations using parallel cache assignment
ACM Transactions on Mathematical Software (TOMS)
An improved parallel singular value algorithm and its implementation for multicore hardware
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A decomposition for in-place matrix transposition
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
In-place transposition of rectangular matrices on accelerators
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Hi-index | 0.00 |
Techniques and algorithms for efficient in-place conversion to and from standard and blocked matrix storage formats are described. Such functionality is required by numerical libraries that use different data layouts internally. Parallel algorithms and a software package for in-place matrix storage format conversion based on in-place matrix transposition are presented and evaluated. A new algorithm for in-place transposition which efficiently determines the structure of the transposition permutation a priori is one of the key ingredients. It enables effective load balancing in a parallel environment.