Designing a Scalable Processor Array for Recurrent Computations

Authors:
Kumar N. Ganapathy;Benjamin W. Wah;Chien-Wei Li
Affiliations:
-;-;-
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
1997

Citing 20
Cited 0

Partitioning and Mapping Algorithms into Fixed Size Systolic Arrays

IEEE Transactions on Computers
The Saxpy Matrix-1: A General-Purpose Systolic Computer

Computer
SLAPP: A Systolic Linear Algebra Parallel Processor

Computer
Partitioning: An Essential Step in Mapping Algorithms Into Systolic Array Processors

Computer
The warp computer: Architecture, implementation, and performance

IEEE Transactions on Computers
Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Matrix computations on mesh arrays

Matrix computations on mesh arrays
Matrix Computations on Systolic-Type Meshes: An Introduction to the Multimesh Graph Method

Computer
Deriving fully efficient systolic arrays by quasi-linear allocation functions

PARLE '91 Proceedings on Parallel architectures and languages Europe : volume I: parallel architectures and algorithms: volume I: parallel architectures and algorithms
Loop partitioning for distributed memory multiprocessors as unimodular transformations

ICS '91 Proceedings of the 5th international conference on Supercomputing
Mapping uniform recurrences onto small size arrays

PARLE '91 Proceedings on Parallel architectures and languages Europe : volume I: parallel architectures and algorithms: volume I: parallel architectures and algorithms
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Compiler blockability of numerical algorithms

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Mapping regular recursive algorithms to fine-grained processor arrays

Mapping regular recursive algorithms to fine-grained processor arrays
Optimal Synthesis of Algorithm-Specific Lower-Dimensional Processor Arrays

IEEE Transactions on Parallel and Distributed Systems
The Organization of Computations for Uniform Recurrence Equations

Journal of the ACM (JACM)
Real Time Signal Processing

Real Time Signal Processing
On Time Mapping of Uniform Dependence Algorithms into Lower Dimensional Processor Arrays

IEEE Transactions on Parallel and Distributed Systems
Optimizing General Design Objectives in Processor-Array Design

Proceedings of the 8th International Symposium on Parallel Processing
I/O complexity: The red-blue pebble game

STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we study the design of a coprocessor (CoP) to execute efficiently recursive algorithms with uniform dependencies. Our design is based on two objectives: 1) fixed bandwidth to main memory (MM) and 2) scalability to higher performance without increasing MM bandwidth. Our CoP has an access unit (AU) organized as multiple queues, a processor array (PA) with regularly connected processing elements (PEs), and input/output networks for data routing. Our design is unique because it addresses input/output bottleneck and scalability, two of the most important issues in integrating processor arrays in current systems. To allow processor arrays to be widely usable, they must be scalable to high performance with little or no impact on the supporting memory system. The use of multiple queues in AU also eliminates the use of explicit data addresses, thereby simplifying the design of the control program. We present a mapping algorithm that partitions a data dependence graph (DG) of an application into regular blocks, sequences the blocks through AU, and schedules the execution of the blocks, one at a time, on PA. We show that our mapping procedure minimizes the amount of communication between blocks in the partitioned DG, and sequences the blocks through AU to reduce the communication between AU and MM. Using the matrix-product and transitive-closure applications, we study design trade-offs involving 1) division of a fixed chip area between PA and AU, and 2) improvements in speedup with respect to increases in chip area. Our results show, for a fixed chip area, 1) that there is little degradation in throughput in using a linear PA as compared to a PA organized as a square mesh, and 2) that the design is not sensitive to the division of chip area between PA and AU. We further show that, for a fixed throughput, there is an inverse square root relationship between speedup and total chip area. Our study demonstrates the feasibility of a low-cost, memory bandwidth-limited, and scalable coprocessor system for evaluating recurrent algorithms with uniform dependencies.