Parallel processing: the Cm* experience
Parallel processing: the Cm* experience
Digital image processing
An Adaptation of the Fast Fourier Transform for Parallel Processing
Journal of the ACM (JACM)
DFT/FFT and Convolution Algorithms: Theory and Implementation
DFT/FFT and Convolution Algorithms: Theory and Implementation
Computer Architecture and Parallel Processing
Computer Architecture and Parallel Processing
Digital Picture Processing
Hi-index | 0.00 |
This paper describes the development of parallel algorithms on M31, a large-scale, shared memory multiprocessor VAX computer. Matrix operations have been optimized for a subset of the BLAS, the Basic Linear Algebra Subroutines. Efficient image processing algorithms were also developed for parallel Convolution, Correlation, and Fast Fourier Transforms (non-synchronizing one and two dimensional FFTs). The effect of matrix partitioning was examined using two different memory allocation strategies. We found that contiguous memory partitioning can yield performance gains beyond the linear expectation. Super performance was achieved through a parallel algorithm devised to minimize cache-replacements. Fewer replacements allowed high CPU utilization with minimal system overhead. Inefficient matrix partitioning tended to stifle parallel performance because frequent cache misses created heavy bus traffic and thus increased system overhead.