Imaging the earth's interior
FFTs in external or hierarchical memory
The Journal of Supercomputing
CCL: A Portable and Tunable Collective Communication Library for Scalable Parallel Computers
IEEE Transactions on Parallel and Distributed Systems
Using MPI: portable parallel programming with the message-passing interface
Using MPI: portable parallel programming with the message-passing interface
Matrix transpose for block allocations on torus and de Bruijn networks
Journal of Parallel and Distributed Computing
Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems
IEEE Transactions on Parallel and Distributed Systems
Optimization of MPI collectives on clusters of large-scale SMP's
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
All-to-All Personalized Communication in Multidimensional Torus and Mesh Networks
IEEE Transactions on Parallel and Distributed Systems
A comparison of optimal FFTs on torus and hypercube multicomputers
Parallel Computing
Spotlight-Mode Synthetic Aperture Radar: A Signal Processing Approach
Spotlight-Mode Synthetic Aperture Radar: A Signal Processing Approach
The Scalability of FFT on Parallel Computers
IEEE Transactions on Parallel and Distributed Systems
CANPC '98 Proceedings of the Second International Workshop on Network-Based Parallel Computing: Communication, Architecture, and Applications
Improved MPI All-to-all Communication on a Giganet SMP Cluster
Proceedings of the 9th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Message passing and shared address space parallelism on an SMP cluster
Parallel Computing
The Performance Advantages of Integrating Message Passing in Cache-Coherent Multiprocessors
The Performance Advantages of Integrating Message Passing in Cache-Coherent Multiprocessors
An efficient parallel-processing method for transposing large matrices in place
IEEE Transactions on Image Processing
Hi-index | 0.00 |
Matrix transpose in parallel systems typically involves costly all-to-all communications. In this paper, we provide a comparative characterization of various efficient algorithms for transposing small and large matrices using the popular symmetric multiprocessors (SMP) architecture, which carries a relatively low communication cost due to its large aggregate bandwidth and low-latency inter-process communication. We conduct analysis on the cost of data sending / receiving and the memory requirement of these matrix-transpose algorithms. We then propose an adaptive algorithm that can minimize the overhead of the matrix transpose operations given the parameters such as the data size, number of processors, start-up time, and the effective communication bandwidth.