Design, implementation, and evaluation of a low-complexity vector-core for executing scalar/vector instructions

Authors:
Mostafa I. Soliman
Affiliations:
-
Venue:
Journal of Parallel and Distributed Computing
Year:
2013

Citing 22
Cited 0

Implementing Precise Interrupts in Pipelined Processors

IEEE Transactions on Computers
MIPS RISC architecture

MIPS RISC architecture
Strip mining on SIMD architectures

ICS '91 Proceedings of the 5th international conference on Supercomputing
Vector architectures: past, present and future

ICS '98 Proceedings of the 12th international conference on Supercomputing
Simple vector microprocessors for multimedia applications

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
How Multimedia Workloads Will Change Processor Design

Computer
Implementing Precise Interruptions in Pipelined RISC Processors

IEEE Micro
An ISA Comparison Between Superscalar and Vector Processors

VECPAR '98 Selected Papers and Invited Talks from the Third International Conference on Vector and Parallel Processing
Very Long Instruction Word architectures and the ELI-512

ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Vector microprocessors

Vector microprocessors
Scalable vector media-processors for embedded systems

Scalable vector media-processors for embedded systems
Compiling for EDGE Architectures

Proceedings of the International Symposium on Code Generation and Optimization
Synergistic Processing in Cell's Multicore Architecture

IEEE Micro
Vector-thread architecture and implementation

Vector-thread architecture and implementation
Validity of the single processor approach to achieving large scale computing capabilities

AFIPS '67 (Spring) Proceedings of the April 18-20, 1967, spring joint computer conference
FPGA Prototyping ByVerilog Examples

FPGA Prototyping ByVerilog Examples
Design and Implementation of a 64-bit RISC Processor Using VHDL

UKSIM '09 Proceedings of the UKSim 2009: 11th International Conference on Computer Modelling and Simulation
Low-complexity vector microprocessor extension

Low-complexity vector microprocessor extension
POWER4 system microarchitecture

IBM Journal of Research and Development
Simplified vector-thread architectures for flexible and efficient data-parallel accelerators

Simplified vector-thread architectures for flexible and efficient data-parallel accelerators
Computer Architecture, Fifth Edition: A Quantitative Approach

Computer Architecture, Fifth Edition: A Quantitative Approach
Computer Organization and Design, Revised Fourth Edition, Fourth Edition: The Hardware/Software Interface

Computer Organization and Design, Revised Fourth Edition, Fourth Edition: The Hardware/Software Interface

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a low-complexity vector-core called LcVc for executing both scalar and vector instructions on the same execution datapath. A unified register file in the decode stage is used for storing both scalar operands and vector elements. The execution stage accepts a new set of operands each cycle and produces a new result. Rather than issuing a vector instruction (1-D operations) as a whole, each vector operation is issued sequentially with the existing scalar issue hardware. In the first implementation of LcVc, all loads and stores of registers take place from the data cache in the memory access stage in a rate of one element per clock cycle. The complete design of our proposed LcVc processor is implemented using VHDL targeting the Xilinx FPGA Spartan 3E, xc3s1600e-4-fg320 device. The total number of slices required for implementing LcVc is 1778, where the number of slice flip-flops is 538 and the number of 4-input LUTs is 3706: 1914 for logic and 1792 for RAMs. Moreover, our performance evaluation results show that the speedup of executing vector addition, vector scaling, SAXPY, and matrix-matrix multiplication on LcVc over the scalar execution are 2.3, 2.5, 1.9, and 3, respectively. The hardware required to support the enhanced vector capability is insignificant (5%), which results in reducing the area per core and increasing the number of cores available in a given chip area.