An Efficient Algorithm for Out-of-Core Matrix Transposition

Authors:
J. Suh;V. K. Prasanna
Affiliations:
-;-
Venue:
IEEE Transactions on Computers
Year:
2002

Citing 13
Cited 10

The input/output complexity of sorting and related problems

Communications of the ACM
Introduction to parallel computing: design and analysis of algorithms

Introduction to parallel computing: design and analysis of algorithms
Virtual memory for data-parallel computing

Virtual memory for data-parallel computing
Efficient transposition algorithms for large matrices

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
RAID: high-performance, reliable secondary storage

ACM Computing Surveys (CSUR)
Early experiences in evaluating the parallel disk model with the ViC implementation

Parallel Computing - Special double issue: parallel I/O
Asymptotically Tight Bounds for Performing BMMC Permutations on Parallel Disk Systems

SIAM Journal on Computing
Optimal read-once parallel disk scheduling

Proceedings of the sixth workshop on I/O in parallel and distributed systems
Computability of Recursive Functions

Journal of the ACM (JACM)
Multidimensional Digital Signal Processing

Multidimensional Digital Signal Processing
Hierarchical tiling for improved superscalar performance

IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Portable Implementation of Real-Time Signal Processing Benchmarks on HPC Platforms

PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
An Improved Parallel Disk Scheduling Algorithm

HIPC '98 Proceedings of the Fifth International Conference on High Performance Computing

Layout transformation support for the disk resident arrays framework

The Journal of Supercomputing
Efficient parallel out-of-core matrix transposition

International Journal of High Performance Computing and Networking
Efficient parallelizations of Hermite and Smith normal form algorithms

Parallel Computing
Generating SIMD vectorized permutations

CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction
Color-Aware Instructions for Embedded Superscalar Processors

Journal of Signal Processing Systems
Implementing and evaluating color-aware instruction set for low-memory, embedded video processing in data parallel architectures

EUC'05 Proceedings of the 2005 international conference on Embedded and Ubiquitous Computing
Efficient layout transformation for disk-based multidimensional arrays

HiPC'04 Proceedings of the 11th international conference on High Performance Computing
Architectural enhancements for color image and video processing on embedded systems

ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
Determining optimal grain size for efficient vector processing on SIMD image processing architectures

ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
Permuting data on random-access block storage

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	14.98

Visualization

Abstract

Efficient transposition of Out-of-core matrices has been widely studied. These efforts have focused on reducing the number of I/O operations. However, in state-of-the-art architectures, memory-memory data transfer time and index computation time are also significant components of the overall time. In this paper, we propose an algorithm that considers the index computation time and the I/O time and reduces the overall execution time. Our algorithm reduces the total execution time by reducing the number of I/O operations and eliminating the index computation. In doing so, two techniques are employed: writing the data onto disk in predefined patterns and balancing the number of disk read and write operations. The index computation time, which is an expensive operation involving two divisions and a multiplication, is eliminated by partitioning the memory into read and write buffers. The expensive in-processor permutation is replaced by data collection from the read buffer to the write buffer. Even though this partitioning may increase the number of I/O operations for some cases, it results in an overall reduction in the execution time due to the elimination of the expensive index computation. Our algorithm is analyzed using the well-known Linear Model and the Parallel Disk Model. The experimental results on Sun Enterprise, SGI R12000, and Pentium III show that our algorithm reduces the overall execution time by up to 50 percent compared with the best known algorithms in the literature.