Modelling and analysis of communication overhead for parallel matrix algorithms

  • Authors:
  • Xiaodong Wang;V.P Roychowdhury

  • Affiliations:
  • Department of Electrical Engineering Texas A&M University, College Station, TX 77843, U.S.A.;Department of Electrical Engineering University of California, Los Angeles, CA 90095, U.S.A.

  • Venue:
  • Mathematical and Computer Modelling: An International Journal
  • Year:
  • 2000

Quantified Score

Hi-index 0.98

Visualization

Abstract

We develop and analyze novel algorithms that make efficient use of the communication system in distributed memory architectures with processing elements interconnected by a hypercube network. The algorithms studied here include the parallel Gauss-Jordan (GJ) matrix inversion algorithm and the Gaussian Elimination (GE) algorithm for LU factorization. We first propose a new broadcasting algorithm on the hypercube multiprocessor for the parallel GJ algorithm. This algorithm ensures that the data items are sent out from the source and arrive at the destinations at the earliest possible time. We then present a parallel GJ inversion algorithm using row partitioning. This algorithm exploits a compute-and-send-ahead strategy for achieving overlapping of communication and computation, and the resulting framework leads to rigorous analytical and model-based numerical performance analysis of our parallel algorithms. In particular, we prove a lower bound on the matrix size such that data transmission is fully overlapped by computation. We also prove that the message queue length in the input buffer of each processor is at most two. We next consider the GJ algorithm under submatrix partitioning, with or without pivoting. We show that when submatrix partitioning is used, even when communication is fully overlapped by computation, the communication overhead is larger than when using row partitioning. Thus, we show that by minimizing the communication overhead, the row partitioning scheme can indeed have better overall performance than the submatrix partitioning scheme. Finally, we extend the idea of overlapping communication and computation to the parallel LU factorization algorithm.