Scalable Hardware-Algorithms for Binary Prefix Sums

  • Authors:
  • R. Lin;K. Nakano;S. Olariu;M. C. Pinotti;J. L. Schwing;A. Y. Zomaya

  • Affiliations:
  • State Univ. of New York Geneseo, Geneseo;Nagoya Institute of Technology, Nagoya, Japan;Old Dominion Univ., Norfolk, VA;I.E.I., C.N.R., Pisa, Italy;Central Washington Univ., Ellensburg;Univ. of West Australia, Perth, Western Australia, Australia

  • Venue:
  • IEEE Transactions on Parallel and Distributed Systems
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this work, we address the problem of designing efficient and scalable hardware-algorithms for computing the sum and prefix sums of a $w^k\hbox{-}{\rm{bit}}$, $(k\geq 2)$, sequence using as basic building blocks linear arrays of at most $w^2$ shift switches, where $w$ is a small power of $2$. An immediate consequence of this feature is that in our designs broadcasts are limited to buses of length at most $w^2$. We adopt a VLSI delay model where the 驴length驴 of a bus is proportional with the number of devices on the bus. We begin by discussing a hardware-algorithm that computes the sum of a $w^k\hbox{-}{\rm{bit}}$ binary sequence in the time of $2k-2$ broadcasts, while the corresponding prefix sums can be computed in the time of $3k-4$ broadcasts. Quite remarkably, in spite of the fact that our hardware-algorithm uses only linear arrays of size at most $w^2$, the total number of broadcasts involved is less than three times the number required by an 驴ideal驴 design. We then go on to propose a second hardware-algorithm, operating in pipelined fashion, that computes the sum of a $kw^k\hbox{-}{\rm{bit}}$ binary sequence in the time of $3k+\lceil\log_w k\rceil -3$ broadcasts. Using this design, the corresponding prefix sums can be computed in the time of $4k+\lceil\log_w k\rceil -5$ broadcasts.