Architecture of a message-driven processor
ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Communication effect basic linear algebra computations on hypercube architectures
Journal of Parallel and Distributed Computing
Solving tridiagonal systems on ensemble architectures
SIAM Journal on Scientific and Statistical Computing
Sparse Cholesky factorization on a local-memory multiprocessor
SIAM Journal on Scientific and Statistical Computing
Introduction to Parallel & Vector Solution of Linear Systems
Introduction to Parallel & Vector Solution of Linear Systems
Block-matrix operations using orthogonal trees
C3P Proceedings of the third conference on Hypercube concurrent computers and applications - Volume 2
Gauss-Jordan inversion with pivoting on the Caltech Mark II hypercube
C3P Proceedings of the third conference on Hypercube concurrent computers and applications - Volume 2
Optimal matrix algorithms on homogeneous hypercubes
C3P Proceedings of the third conference on Hypercube concurrent computers and applications - Volume 2
Optimum Broadcasting and Personalized Communication in Hypercubes
IEEE Transactions on Computers
Active messages: a mechanism for integrated communication and computation
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Unifying and Optimizing Parallel Linear Algebra Algorithms
IEEE Transactions on Parallel and Distributed Systems
Isoefficiency: Measuring the Scalability of Parallel Algorithms and Architectures
IEEE Parallel & Distributed Technology: Systems & Technology
The Scalability of FFT on Parallel Computers
IEEE Transactions on Parallel and Distributed Systems
Scheduling Linear Algebra Parallel Algorithms on MIMD Architectures
Proceedings of the Fourth SIAM Conference on Parallel Processing for Scientific Computing
Performance analysis of Cooley-Tukey FFT algorithms for a many-core architecture
SpringSim '10 Proceedings of the 2010 Spring Simulation Multiconference
Mathematical and Computer Modelling: An International Journal
Hi-index | 0.98 |
We develop and analyze novel algorithms that make efficient use of the communication system in distributed memory architectures with processing elements interconnected by a hypercube network. The algorithms studied here include the parallel Gauss-Jordan (GJ) matrix inversion algorithm and the Gaussian Elimination (GE) algorithm for LU factorization. We first propose a new broadcasting algorithm on the hypercube multiprocessor for the parallel GJ algorithm. This algorithm ensures that the data items are sent out from the source and arrive at the destinations at the earliest possible time. We then present a parallel GJ inversion algorithm using row partitioning. This algorithm exploits a compute-and-send-ahead strategy for achieving overlapping of communication and computation, and the resulting framework leads to rigorous analytical and model-based numerical performance analysis of our parallel algorithms. In particular, we prove a lower bound on the matrix size such that data transmission is fully overlapped by computation. We also prove that the message queue length in the input buffer of each processor is at most two. We next consider the GJ algorithm under submatrix partitioning, with or without pivoting. We show that when submatrix partitioning is used, even when communication is fully overlapped by computation, the communication overhead is larger than when using row partitioning. Thus, we show that by minimizing the communication overhead, the row partitioning scheme can indeed have better overall performance than the submatrix partitioning scheme. Finally, we extend the idea of overlapping communication and computation to the parallel LU factorization algorithm.