Fine-grain performance scaling of soft vector processors

  • Authors:
  • Peter Yiannacouras;J. Gregory Steffan;Jonathan Rose

  • Affiliations:
  • University of Toronto, Toronto, Canada;University of Toronto, Toronto, Canada;University of Toronto, Toronto, Canada

  • Venue:
  • CASES '09 Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Embedded systems are often implemented on FPGA devices and 25% of the time include a soft processor--a processor built using the FPGA reprogrammable fabric. Because of their prevalence and flexibility, soft processors are compelling targets for customization--although current soft processors provide few architectural variations. Recent work has proposed augmenting soft processors with customizable vector processing support, enabling designers to easily scale performance by exploiting the data parallelism available in an application. However this approach provides only coarse-grain scaling, by successively doubling the number of vector datapaths for less than double the performance. In this work we further augment soft vector processors with more fine-grain architectural modifications: we add support for (i) vector chaining and (ii) heterogeneous vector lanes, allowing the soft vector processor to be customized to not only the data-level parallelism available in an application, but to the functional unit demand. We evaluate the area and wall clock performance with full hardware implementations on state-of-the-art FPGAs and find that chaining can provide between 15-45% average performance for less area than doubling the lanes, and that heterogeneous lanes can save 6-13% area with little or no performance loss in some cases. Finally, we implement 1200 soft vector processors variants and find that the peak performance per area compared to our base vector processor can be increased by an average of 13% and up to 34% when choosing the best variant per application.