Modelling and analysis of communication overhead for parallel matrix algorithms

Authors:
Xiaodong Wang;V.P Roychowdhury
Affiliations:
Department of Electrical Engineering Texas A&M University, College Station, TX 77843, U.S.A.;Department of Electrical Engineering University of California, Los Angeles, CA 90095, U.S.A.
Venue:
Mathematical and Computer Modelling: An International Journal
Year:
2000

Citing 15
Cited 2

Architecture of a message-driven processor

ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Communication effect basic linear algebra computations on hypercube architectures

Journal of Parallel and Distributed Computing
Solving tridiagonal systems on ensemble architectures

SIAM Journal on Scientific and Statistical Computing
Sparse Cholesky factorization on a local-memory multiprocessor

SIAM Journal on Scientific and Statistical Computing
Introduction to Parallel & Vector Solution of Linear Systems

Introduction to Parallel & Vector Solution of Linear Systems
Block-matrix operations using orthogonal trees

C3P Proceedings of the third conference on Hypercube concurrent computers and applications - Volume 2
Gauss-Jordan inversion with pivoting on the Caltech Mark II hypercube

C3P Proceedings of the third conference on Hypercube concurrent computers and applications - Volume 2
Optimal matrix algorithms on homogeneous hypercubes

C3P Proceedings of the third conference on Hypercube concurrent computers and applications - Volume 2
Optimum Broadcasting and Personalized Communication in Hypercubes

IEEE Transactions on Computers
Active messages: a mechanism for integrated communication and computation

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Unifying and Optimizing Parallel Linear Algebra Algorithms

IEEE Transactions on Parallel and Distributed Systems
Minimizing the communication time for matrix multiplication on multiprocessors

Parallel Computing
Isoefficiency: Measuring the Scalability of Parallel Algorithms and Architectures

IEEE Parallel & Distributed Technology: Systems & Technology
The Scalability of FFT on Parallel Computers

IEEE Transactions on Parallel and Distributed Systems
Scheduling Linear Algebra Parallel Algorithms on MIMD Architectures

Proceedings of the Fourth SIAM Conference on Parallel Processing for Scientific Computing

Performance analysis of Cooley-Tukey FFT algorithms for a many-core architecture

SpringSim '10 Proceedings of the 2010 Spring Simulation Multiconference
Parallel short range molecular dynamics simulations on computer clusters: Performance evaluation and modeling

Mathematical and Computer Modelling: An International Journal

Quantified Score

Hi-index	0.98

Visualization

Abstract

We develop and analyze novel algorithms that make efficient use of the communication system in distributed memory architectures with processing elements interconnected by a hypercube network. The algorithms studied here include the parallel Gauss-Jordan (GJ) matrix inversion algorithm and the Gaussian Elimination (GE) algorithm for LU factorization. We first propose a new broadcasting algorithm on the hypercube multiprocessor for the parallel GJ algorithm. This algorithm ensures that the data items are sent out from the source and arrive at the destinations at the earliest possible time. We then present a parallel GJ inversion algorithm using row partitioning. This algorithm exploits a compute-and-send-ahead strategy for achieving overlapping of communication and computation, and the resulting framework leads to rigorous analytical and model-based numerical performance analysis of our parallel algorithms. In particular, we prove a lower bound on the matrix size such that data transmission is fully overlapped by computation. We also prove that the message queue length in the input buffer of each processor is at most two. We next consider the GJ algorithm under submatrix partitioning, with or without pivoting. We show that when submatrix partitioning is used, even when communication is fully overlapped by computation, the communication overhead is larger than when using row partitioning. Thus, we show that by minimizing the communication overhead, the row partitioning scheme can indeed have better overall performance than the submatrix partitioning scheme. Finally, we extend the idea of overlapping communication and computation to the parallel LU factorization algorithm.