Scalable Hardware-Algorithms for Binary Prefix Sums

Authors:
R. Lin;K. Nakano;S. Olariu;M. C. Pinotti;J. L. Schwing;A. Y. Zomaya
Affiliations:
State Univ. of New York Geneseo, Geneseo;Nagoya Institute of Technology, Nagoya, Japan;Old Dominion Univ., Norfolk, VA;I.E.I., C.N.R., Pisa, Italy;Central Washington Univ., Ellensburg;Univ. of West Australia, Perth, Western Australia, Australia
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2000

Citing 20
Cited 10

Principles of CMOS VLSI design: a systems perspective

Principles of CMOS VLSI design: a systems perspective
Scans as Primitive Parallel Operations

IEEE Transactions on Computers
Polymorphic-Torus Network

IEEE Transactions on Computers
Parallel Computations on Reconfigurable Meshes

IEEE Transactions on Computers
Parallel computing using the prefix problem

Parallel computing using the prefix problem
Reconfigurable Buses with Shift Switching: Concepts and Applications

IEEE Transactions on Parallel and Distributed Systems
Low-power design techniques for high-performance CMOS adders

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Data communications, computer networks and open systems (4th ed.)

Data communications, computer networks and open systems (4th ed.)
The Design of an Optoelectronic Arithmetic Processor Based on Permutation Networks

IEEE Transactions on Computers
Parallel computation: models and methods

Parallel computation: models and methods
An Efficient Algorithm for Row Minima Computations on Basic Reconfigurable Meshes

IEEE Transactions on Parallel and Distributed Systems
Computer arithmetic: algorithms and hardware designs

Computer arithmetic: algorithms and hardware designs
Parallel Prefix Computation

Journal of the ACM (JACM)
Contemporary Logic Design

Contemporary Logic Design
Computer Arithmetic

Computer Arithmetic
Digital Computer Arithmetic

Digital Computer Arithmetic
Pipelined Adders

IEEE Transactions on Computers
Polymorphic Processor Arrays

IEEE Transactions on Parallel and Distributed Systems
Performance Driven Synthesis for Pass-Transistor Logic

VLSID '99 Proceedings of the 12th International Conference on VLSI Design - 'VLSI for the Information Appliance'
Computational Aspects of VLSI

Computational Aspects of VLSI

Z4: a new depth-size optimal parallel prefix circuit with small depth

Neural, Parallel & Scientific Computations
A new approach to constructing optimal parallel prefix circuits with small depth

Journal of Parallel and Distributed Computing
Faster optimal parallel prefix circuits: New algorithmic construction

Journal of Parallel and Distributed Computing
Computation-efficient parallel prefix

AIC'06 Proceedings of the 6th WSEAS International Conference on Applied Informatics and Communications
Two families of parallel prefix algorithms for multicomputers

TELE-INFO'08 Proceedings of the 7th WSEAS International Conference on Telecommunications and Informatics
Straightforward construction of depth-size optimal, parallel prefix circuits with fan-out 2

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Parallel prefix algorithms on the multicomputer

WSEAS Transactions on Computer Research
Fast problem-size-independent parallel prefix circuits

Journal of Parallel and Distributed Computing
New parallel prefix algorithms

AIC'09 Proceedings of the 9th WSEAS international conference on Applied informatics and communications
New families of computation-efficient parallel prefix algorithms

WSEAS Transactions on Computers

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this work, we address the problem of designing efficient and scalable hardware-algorithms for computing the sum and prefix sums of a $w^k\hbox{-}{\rm{bit}}$, $(k\geq 2)$, sequence using as basic building blocks linear arrays of at most $w^2$ shift switches, where $w$ is a small power of $2$. An immediate consequence of this feature is that in our designs broadcasts are limited to buses of length at most $w^2$. We adopt a VLSI delay model where the 驴length驴 of a bus is proportional with the number of devices on the bus. We begin by discussing a hardware-algorithm that computes the sum of a $w^k\hbox{-}{\rm{bit}}$ binary sequence in the time of $2k-2$ broadcasts, while the corresponding prefix sums can be computed in the time of $3k-4$ broadcasts. Quite remarkably, in spite of the fact that our hardware-algorithm uses only linear arrays of size at most $w^2$, the total number of broadcasts involved is less than three times the number required by an 驴ideal驴 design. We then go on to propose a second hardware-algorithm, operating in pipelined fashion, that computes the sum of a $kw^k\hbox{-}{\rm{bit}}$ binary sequence in the time of $3k+\lceil\log_w k\rceil -3$ broadcasts. Using this design, the corresponding prefix sums can be computed in the time of $4k+\lceil\log_w k\rceil -5$ broadcasts.