The input/output complexity of sorting and related problems
Communications of the ACM
Introduction to parallel computing: design and analysis of algorithms
Introduction to parallel computing: design and analysis of algorithms
Efficient transposition algorithms for large matrices
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
RAID: high-performance, reliable secondary storage
ACM Computing Surveys (CSUR)
Optimal read-once parallel disk scheduling
Proceedings of the sixth workshop on I/O in parallel and distributed systems
Multidimensional Digital Signal Processing
Multidimensional Digital Signal Processing
Hierarchical tiling for improved superscalar performance
IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Portable Implementation of Real-Time Signal Processing Benchmarks on HPC Platforms
PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
An Improved Parallel Disk Scheduling Algorithm
HIPC '98 Proceedings of the Fifth International Conference on High Performance Computing
Data Management for Large-Scale Scientific Computations in High Performance Distributed Systems
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Design, Implementation and Evaluation of Parallel Pipelined STAP on Parallel Computers
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
On Transposing Large 2nx 2nMatrices
IEEE Transactions on Computers
A Fast Computer Method for Matrix Transposing
IEEE Transactions on Computers
A Method for Transposing Externally Stored Matrices
IEEE Transactions on Computers
A Generalization of Eklundh's Algorithm for Transposing Large Matrices
IEEE Transactions on Computers
Hi-index | 0.00 |
Efficient transposition of large-scale matrices has been widely studied. These efforts have focused on reducing the number of I/O operations. However, in the state-of-the-art architectures, data transfer time and index computation time are also significant components of the overall time. In this paper, we propose an algorithm that considers all these costs and reduces the over all execution time. The reduction of the overall execution time is achieved by using two techniques: (1) writing the data onto disk in predefined patterns and (2) balancing the numbers of disk read and write operations. Even though our approach may increase, the number of I/O operations for some cases it results in an overall reduction in the execution time. The index computation time, which is an expensive operation involving two divisions and a multiplication, is eliminated by partitioning the memory into two buffers. The expensive in-processor permutation is replaced by data collection operations. Our algorithm is analyzed using the well-known Linear Model and the Parallel Disk Model. The experimental results on a Sun Enterprise and a DEC Alpha show that our algorithm reduces the execution time by about 50%, compared with the best known algorithms in the literature.