A performance and energy comparison of convolution on GPUs, FPGAs, and multicore processors

Authors:
Jeremy Fowers;Greg Brown;John Wernsing;Greg Stitt
Affiliations:
University of Florida;University of Florida;University of Florida;University of Florida
Venue:
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Year:
2013

Citing 18
Cited 0

System-level exploration for Pareto-optimal configurations in parameterized system-on-a-chip (December 2002)

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A quantitative analysis of the speedup factors of FPGAs over processors

FPGA '04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays
Closing the Gap: CPU and FPGA Trends in Sustainable Floating-Point BLAS Performance

FCCM '04 Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Matched Filter Computation on FPGA, Cell and GPU

FCCM '07 Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Accelerating Compute-Intensive Applications with GPUs and FPGAs

SASP '08 Proceedings of the 2008 Symposium on Application Specific Processors
High speed 3D tomography on CPU, GPU, and FPGA

EURASIP Journal on Embedded Systems - Special issue on design and architectures for signal and image processing
On the energy efficiency of graphics processing units for scientific computing

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
On the Robust Mapping of Dynamic Programming onto a Graphics Processing Unit

ICPADS '09 Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems
Optimized generation of memory structure in compiling window operations onto reconfigurable hardware

ARC'07 Proceedings of the 3rd international conference on Reconfigurable computing: architectures, tools and applications
BLAS Comparison on FPGA, CPU and GPU

ISVLSI '10 Proceedings of the 2010 IEEE Annual Symposium on VLSI
Characterization of Fixed and Reconfigurable Multi-Core Devices for Application Acceleration

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Assessing Accelerator-Based HPC Reverse Time Migration

IEEE Transactions on Parallel and Distributed Systems
Novo-G: At the Forefront of Scalable Reconfigurable Supercomputing

Computing in Science and Engineering
Memory-Efficient IPv4/v6 Lookup on FPGAs Using Distance-Bounded Path Compression

FCCM '11 Proceedings of the 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines
An FPGA Implementation of Information Theoretic Visual-Saliency System and Its Optimization

FCCM '11 Proceedings of the 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines
Scalable, High Performance Fourier Domain Optical Coherence Tomography: Why FPGAs and Not GPGPUs

FCCM '11 Proceedings of the 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines
Platform-aware bottleneck detection for reconfigurable computing applications

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications

Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent architectural trends have focused on increased parallelism via multicore processors and increased heterogeneity via accelerator devices (e.g., graphics-processing units, field-programmable gate arrays). Although these architectures have significant performance and energy potential, application designers face many device-specific challenges when choosing an appropriate accelerator or when customizing an algorithm for an accelerator. To help address this problem, in this article we thoroughly evaluate convolution, one of the most common operations in digital-signal processing, on multicores, graphics-processing units, and field-programmable gate arrays. Whereas many previous application studies evaluate a specific usage of an application, this article assists designers with design space exploration for numerous use cases by analyzing effects of different input sizes, different algorithms, and different devices, while also determining Pareto-optimal trade-offs between performance and energy.