Efficient algorithms for block-cyclic array redistribution between processor sets

Authors:
Neungsoo Park;Viktor K. Prasanna;Cauligi Raghavendra
Affiliations:
University of Southern California, Los Angeles, CA;University of Southern California, Los Angeles, CA;The Aerospace Corporation, Los Angeles, CA
Venue:
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Year:
1998

Citing 16
Cited 2

The high performance Fortran handbook

The high performance Fortran handbook
An approach to communication-efficient data redistribution

ICS '94 Proceedings of the 8th international conference on Supercomputing
Compilation techniques for block-cyclic distributions

ICS '94 Proceedings of the 8th international conference on Supercomputing
Efficient algorithms for all-to-all communications in multi-port message-passing systems

SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Fast runtime block cyclic data redistribution on multiprocessors

Journal of Parallel and Distributed Computing
A Basic-Cycle Calculation Technique for Efficient Dynamic Data Redistribution

IEEE Transactions on Parallel and Distributed Systems
Efficient Algorithms for Array Redistribution

IEEE Transactions on Parallel and Distributed Systems
Processor Mapping Techniques Toward Efficient Data Redistribution

Proceedings of the 8th International Symposium on Parallel Processing
Multi-phase array redistribution: modeling and evaluation

IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Automatic generation of efficient array redistribution routines for distributed memory multicomputers

FRONTIERS '95 Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation (Frontiers'95)
Efficient Algorithms for Block-Cyclic Redistribution of Arrays

SPDP '96 Proceedings of the 8th IEEE Symposium on Parallel and Distributed Processing (SPDP '96)
Communication issues in heterogeneous embedded systems

WPDRTS '96 Proceedings of the 4th International Workshop on Parallel and Distributed Real-Time Systems
LAPACK Working Note 65: Parallel Matrix Transpose Algorithms on Distributed Memory Concurrent Computers

LAPACK Working Note 65: Parallel Matrix Transpose Algorithms on Distributed Memory Concurrent Computers
MPI: A Message-Passing Interface Standard

MPI: A Message-Passing Interface Standard
LAPACK Working Note 95: ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers -- Design Issues and Performance

LAPACK Working Note 95: ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers -- Design Issues and Performance
Scheduling Block-Cyclic Array Redistribution

Scheduling Block-Cyclic Array Redistribution

Block-cyclic redistribution over heterogeneous networks

Cluster Computing
A Memory-Efficient Data Redistribution Algorithm

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface

Quantified Score

Hi-index	0.01

Visualization

Abstract

Run-time array redistribution is necessary to enhance the performance of parallel programs on distributed memory supercomputers. In this paper, we present an efficient algorithm for array redistribution from cyclic(x) on P processors to cyclic(Kx) on Q processors. The algorithm reduces the overall time for communication by considering the data transfer, communication schedule, and index computation costs. The proposed algorithm is based on a generalized circulant matrix formalism . Our algorithm generates a schedule that minimizes the number of communication steps and eliminates node contention in each communication step. The network bandwidth is fully utilized by ensuring that equal-sized messages are transferred in each communication step. Furthermore, the procedure to compute the schedule and the index sets is extremely fast. It takes O(max(P, Q)) time. Therefore, our proposed algorithm is suitable for run-time array redistribution. To evaluate the performance of our scheme, we have implemented the algorithm using C and MPI. The experiments were conducted on the IBM SP2. The experimental results show that the proposed algorithm outperforms well-known algorithms when the total communication time including the data transfer and schedule and index computation times are considered.