Multicore-based vector coprocessor sharing for performance and energy gains

Authors:
Spiridon F. Beldianu;Sotirios G. Ziavras
Affiliations:
New Jersey Institute of Technology, Newark, NJ;New Jersey Institute of Technology, Newark, NJ
Venue:
ACM Transactions on Embedded Computing Systems (TECS) - Special issue on application-specific processors
Year:
2013

Citing 16
Cited 0

Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
Simultaneous Multithreading: A Platform for Next-Generation Processors

IEEE Micro
Vector vs. superscalar and VLIW architectures for embedded multimedia benchmarks

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Overcoming the limitations of conventional vector processors

Proceedings of the 30th annual international symposium on Computer architecture
SODA: A Low-power Architecture For Software Radio

Proceedings of the 33rd annual international symposium on Computer Architecture
On the Scalability of 1- and 2-Dimensional SIMD Extensions for Multimedia Applications

ISPASS '05 Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2005
Scalable Vector Processors for Embedded Systems

IEEE Micro
VESPA: portable, scalable, and flexible FPGA-based vector processors

CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Low Power Methodology Manual: For System-on-Chip Design

Low Power Methodology Manual: For System-on-Chip Design
Vector Processing as a Soft Processor Accelerator

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Scalar Processing Overhead on SIMD-Only Architectures

ASAP '09 Proceedings of the 2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors
Efficient multi-ported memories for FPGAs

Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays
AnySP: Anytime Anywhere Anyway Signal Processing

IEEE Micro
VEGAS: soft vector processor with scratchpad memory

Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
Co-synthesis of FPGA-based application-specific floating point simd accelerators

Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
On-chip Vector Coprocessor Sharing for Multicores

PDP '11 Proceedings of the 2011 19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

For most of the applications that make use of a dedicated vector coprocessor, its resources are not highly utilized due to the lack of sustained data parallelism which often occurs due to vector-length variations in dynamic environments. The motivation of our work stems from: (a) the mandate for multicore designs to make efficient use of on-chip resources for low power and high performance; (b) the omnipresence of vector operations in high-performance scientific and emerging embedded applications; (c) the need to often handle a variety of vector sizes; and (d) vector kernels in application suites may have diverse computation needs. We present a robust design framework for vector coprocessor sharing in multicore environments that maximizes vector unit utilization and performance at substantially reduced energy costs. For our adaptive vector unit, which is attached to multiple cores, we propose three basic shared working policies that enforce coarse-grain, fine-grain, and vector-lane sharing. We benchmark these vector coprocessor sharing policies for a dual-core system and evaluate them using the floating-point performance, resource utilization, and power/energy consumption metrics. Benchmarking for FIR filtering, FFT, matrix multiplication, and LU factorization shows that these coprocessor sharing policies yield high utilization and performance with low energy costs. The proposed policies provide 1.2--2 speedups and reduce the energy needs by about 50% as compared to a system having a single core with an attached vector coprocessor. With the performance expressed in clock cycles, the sharing policies demonstrate 3.62--7.92 speedups compared to optimized Xeon runs. We also introduce performance and empirical power models that can be used by the runtime system to estimate the effectiveness of each policy in a hybrid system that can simultaneously implement this suite of shared coprocessor policies.