Expressing Boolean cube matrix algorithms in shared memory primitives

Authors:
S. L. Johnsson;C-T. Ho
Affiliations:
Department of Computer Science, Yale University, New Haven, CT;Department of Computer Science, Yale University, New Haven, CT
Venue:
C3P Proceedings of the third conference on Hypercube concurrent computers and applications - Volume 2
Year:
1989

Citing 5
Cited 0

Communication effect basic linear algebra computations on hypercube architectures

Journal of Parallel and Distributed Computing
Optimal algorithms for stable dimension permutations on Boolean cubes

C3P Proceedings of the third conference on Hypercube concurrent computers and applications: Architecture, software, computer systems, and general issues - Volume 1
Optimum Broadcasting and Personalized Communication in Hypercubes

IEEE Transactions on Computers
A cellular computer to implement the kalman filter algorithm

A cellular computer to implement the kalman filter algorithm
Combinatorial Algorithms: Theory and Practice

Combinatorial Algorithms: Theory and Practice

Quantified Score

Hi-index	0.00

Visualization

Abstract

The multiplication of (large) matrices allocated evenly on Boolean cube configured multiprocessors poses several interesting trade-offs with respect to communication time, processor utilization, and storage requirement. In [7] we investigated several algorithms for different degrees of parallelization, and showed how the choice of algorithm with respect to performance depends on the matrix shape, and the multiprocessor parameters, and how processors should be allocated optimally to the different loops.In this paper the focus is on expressing the algorithms in shared memory type primitives. We assume that all processors share the same global address space, and present communication primitives both for nearest-neighbor communication, and global operations such as broadcasting from one processor to a set of processors, the reverse operation of plus-reduction, and matrix transposition (dimension permutation). We consider both the case where communication is restricted to one processor port at a time, or concurrent communication on all processor ports. The communication algorithms are provably optimal within a factor of two. We describe both constant storage algorithms, and algorithms with reduced communication time, but a storage need proportional to the number of processors and the matrix sizes (for a one-dimensional partitioning of the matrices).