Versatile design of shared vector coprocessors for multicores

Authors:
Spiridon F. Beldianu;Christopher Dahlberg;Timothy Steele;Sotirios G. Ziavras
Affiliations:
Electrical and Computer Engineering Department, New Jersey Institute of Technology, Newark, NJ 07102, USA;Electrical and Computer Engineering Department, New Jersey Institute of Technology, Newark, NJ 07102, USA;Electrical and Computer Engineering Department, New Jersey Institute of Technology, Newark, NJ 07102, USA;Electrical and Computer Engineering Department, New Jersey Institute of Technology, Newark, NJ 07102, USA
Venue:
Microprocessors & Microsystems
Year:
2012

Citing 13
Cited 0

Vector vs. superscalar and VLIW architectures for embedded multimedia benchmarks

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Overcoming the limitations of conventional vector processors

Proceedings of the 30th annual international symposium on Computer architecture
SODA: A Low-power Architecture For Software Radio

Proceedings of the 33rd annual international symposium on Computer Architecture
On the Scalability of 1- and 2-Dimensional SIMD Extensions for Multimedia Applications

ISPASS '05 Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2005
Scalable Vector Processors for Embedded Systems

IEEE Micro
VESPA: portable, scalable, and flexible FPGA-based vector processors

CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Low Power Methodology Manual: For System-on-Chip Design

Low Power Methodology Manual: For System-on-Chip Design
Vector Processing as a Soft Processor Accelerator

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Scalar Processing Overhead on SIMD-Only Architectures

ASAP '09 Proceedings of the 2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors
Efficient multi-ported memories for FPGAs

Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays
Investigation of Factors Impacting Thread-Level Parallelism from Desktop, Multimedia and HPC Applications

FCST '09 Proceedings of the 2009 Fourth International Conference on Frontier of Computer Science and Technology
AnySP: Anytime Anywhere Anyway Signal Processing

IEEE Micro
On-chip Vector Coprocessor Sharing for Multicores

PDP '11 Proceedings of the 2011 19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

For a wide range of applications that make use of a vector coprocessor, its resources are not highly utilized due to the lack of sustained data parallelism, which often occurs due to insufficient vector parallelism or vector-length variations in dynamic environments. The motivation of our work stems from (a) the omnipresence of vector operations in high-performance scientific and emerging embedded applications; (b) the mandate for multicore designs to make efficient use of on-chip resources for low power and high performance; (c) the need to often handle a variety of vector sizes; and (d) vector kernels in application suites may have diverse computation needs. Our objective is to provide a versatile design framework that can facilitate vector coprocessor sharing among multiple cores in a manner that maximizes resource utilization while also yielding very high performance at reduced area and energy costs. We have previously proposed three basic shared vector coprocessor architectures based on coarse-grain temporal, fine-grain temporal, and vector lane sharing that were implemented in SystemVerilog [15]. Our new paper presents substantially improved versions of these architectures that are implemented in synthesized RTL for higher accuracy. We herein evaluate these vector coprocessor sharing policies for a dual-core system using the floating-point performance, resource utilization and power consumption metrics. Benchmarking for FIR filtering, FFT, matrix multiplication, LU decomposition and sparse matrix vector multiplication shows that these coprocessor sharing policies yield high utilization, high performance and low energy per operation. Fine-grain temporal sharing most often provides the best performance among the three policies; it is followed by vector lane and then coarse-grain temporal sharing. It is also shown that, per core exclusive access to the vector resources does not maximize their utilization. This benchmarking involves various scenarios for each application, where the scenarios differ in terms of the vector length and the parallelism-oriented coding technique.