The input/output complexity of sorting and related problems
Communications of the ACM
Introduction to parallel computing: design and analysis of algorithms
Introduction to parallel computing: design and analysis of algorithms
Virtual memory for data-parallel computing
Virtual memory for data-parallel computing
Efficient transposition algorithms for large matrices
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
RAID: high-performance, reliable secondary storage
ACM Computing Surveys (CSUR)
Early experiences in evaluating the parallel disk model with the ViC implementation
Parallel Computing - Special double issue: parallel I/O
Asymptotically Tight Bounds for Performing BMMC Permutations on Parallel Disk Systems
SIAM Journal on Computing
Optimal read-once parallel disk scheduling
Proceedings of the sixth workshop on I/O in parallel and distributed systems
Computability of Recursive Functions
Journal of the ACM (JACM)
Multidimensional Digital Signal Processing
Multidimensional Digital Signal Processing
Hierarchical tiling for improved superscalar performance
IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Portable Implementation of Real-Time Signal Processing Benchmarks on HPC Platforms
PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
An Improved Parallel Disk Scheduling Algorithm
HIPC '98 Proceedings of the Fifth International Conference on High Performance Computing
Layout transformation support for the disk resident arrays framework
The Journal of Supercomputing
Efficient parallel out-of-core matrix transposition
International Journal of High Performance Computing and Networking
Efficient parallelizations of Hermite and Smith normal form algorithms
Parallel Computing
Generating SIMD vectorized permutations
CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction
Color-Aware Instructions for Embedded Superscalar Processors
Journal of Signal Processing Systems
EUC'05 Proceedings of the 2005 international conference on Embedded and Ubiquitous Computing
Efficient layout transformation for disk-based multidimensional arrays
HiPC'04 Proceedings of the 11th international conference on High Performance Computing
Architectural enhancements for color image and video processing on embedded systems
ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
Permuting data on random-access block storage
Proceedings of the VLDB Endowment
Hi-index | 14.98 |
Efficient transposition of Out-of-core matrices has been widely studied. These efforts have focused on reducing the number of I/O operations. However, in state-of-the-art architectures, memory-memory data transfer time and index computation time are also significant components of the overall time. In this paper, we propose an algorithm that considers the index computation time and the I/O time and reduces the overall execution time. Our algorithm reduces the total execution time by reducing the number of I/O operations and eliminating the index computation. In doing so, two techniques are employed: writing the data onto disk in predefined patterns and balancing the number of disk read and write operations. The index computation time, which is an expensive operation involving two divisions and a multiplication, is eliminated by partitioning the memory into read and write buffers. The expensive in-processor permutation is replaced by data collection from the read buffer to the write buffer. Even though this partitioning may increase the number of I/O operations for some cases, it results in an overall reduction in the execution time due to the elimination of the expensive index computation. Our algorithm is analyzed using the well-known Linear Model and the Parallel Disk Model. The experimental results on Sun Enterprise, SGI R12000, and Pentium III show that our algorithm reduces the overall execution time by up to 50 percent compared with the best known algorithms in the literature.