Design, implementation, and evaluation of a low-complexity vector-core for executing scalar/vector instructions

  • Authors:
  • Mostafa I. Soliman

  • Affiliations:
  • -

  • Venue:
  • Journal of Parallel and Distributed Computing
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper proposes a low-complexity vector-core called LcVc for executing both scalar and vector instructions on the same execution datapath. A unified register file in the decode stage is used for storing both scalar operands and vector elements. The execution stage accepts a new set of operands each cycle and produces a new result. Rather than issuing a vector instruction (1-D operations) as a whole, each vector operation is issued sequentially with the existing scalar issue hardware. In the first implementation of LcVc, all loads and stores of registers take place from the data cache in the memory access stage in a rate of one element per clock cycle. The complete design of our proposed LcVc processor is implemented using VHDL targeting the Xilinx FPGA Spartan 3E, xc3s1600e-4-fg320 device. The total number of slices required for implementing LcVc is 1778, where the number of slice flip-flops is 538 and the number of 4-input LUTs is 3706: 1914 for logic and 1792 for RAMs. Moreover, our performance evaluation results show that the speedup of executing vector addition, vector scaling, SAXPY, and matrix-matrix multiplication on LcVc over the scalar execution are 2.3, 2.5, 1.9, and 3, respectively. The hardware required to support the enhanced vector capability is insignificant (5%), which results in reducing the area per core and increasing the number of cores available in a given chip area.