Parallel algorithms for super performance

  • Authors:
  • D. J. Shakshober

  • Affiliations:
  • Digital Equipment Corporation, BXB2-2/G08, 60 Codman Hill Road, Boxboro, Ma

  • Venue:
  • Proceedings of the 1989 ACM/IEEE conference on Supercomputing
  • Year:
  • 1989

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes the development of parallel algorithms on M31, a large-scale, shared memory multiprocessor VAX computer. Matrix operations have been optimized for a subset of the BLAS, the Basic Linear Algebra Subroutines. Efficient image processing algorithms were also developed for parallel Convolution, Correlation, and Fast Fourier Transforms (non-synchronizing one and two dimensional FFTs). The effect of matrix partitioning was examined using two different memory allocation strategies. We found that contiguous memory partitioning can yield performance gains beyond the linear expectation. Super performance was achieved through a parallel algorithm devised to minimize cache-replacements. Fewer replacements allowed high CPU utilization with minimal system overhead. Inefficient matrix partitioning tended to stifle parallel performance because frequent cache misses created heavy bus traffic and thus increased system overhead.