Closing the Gap: CPU and FPGA Trends in Sustainable Floating-Point BLAS Performance

Authors:
Keith D. Underwood;K. Scott Hemmert
Affiliations:
Sandia National Laboratories, Albuquerque, NM;Sandia National Laboratories, Albuquerque, NM
Venue:
FCCM '04 Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Year:
2004

Citing 0
Cited 37

Sparse Matrix-Vector multiplication on FPGAs

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
64-bit floating-point FPGA matrix multiplication

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Designing Scalable FPGA-Based Reduction Circuits Using Pipelined Floating-Point Cores

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 3 - Volume 04
H-SIMD Machine: Configurable Parallel Computing for Matrix Multiplication

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
High Performance Linear Algebra Operations on Reconfigurable Systems

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Embedded floating-point units in FPGAs

Proceedings of the 2006 ACM/SIGDA 14th international symposium on Field programmable gate arrays
Scalable Hybrid Designs for Linear Algebra on Reconfigurable Computing Systems

ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Architectures and APIs: assessing requirements for delivering FPGA performance to applications

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Optimized high-order finite difference wave equations modeling on reconfigurable computing platform

Microprocessors & Microsystems
Automatic mapping of nested loops to FPGAS

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Using FPGA Devices to Accelerate Biomolecular Simulations

Computer
Scalable and Modular Algorithms for Floating-Point Matrix Multiplication on Reconfigurable Computing Systems

IEEE Transactions on Parallel and Distributed Systems
Examining the viability of FPGA supercomputing

EURASIP Journal on Embedded Systems
High-Performance Reduction Circuits Using Deeply Pipelined Operators on FPGAs

IEEE Transactions on Parallel and Distributed Systems
Novel hardware-based approaches for intrusion detection

ICCOM'05 Proceedings of the 9th WSEAS International Conference on Communications
Architectural modifications to enhance the floating-point performance of FPGAs

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Implementation of a double-precision multiplier accumulator with exception treatment to a dense matrix multiplier module in FPGA

Proceedings of the 21st annual symposium on Integrated circuits and system design
FPGA Acceleration of RankBoost in Web Search Engines

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation

Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation
From Silicon to Science: The Long Road to Production Reconfigurable Supercomputing

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Parallel implementation of Cholesky LLT-algorithm in FPGA-based processor

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Fast, Efficient Floating-Point Adders and Multipliers for FPGAs

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Characterization of Fixed and Reconfigurable Multi-Core Devices for Application Acceleration

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
FPGA-Array with Bandwidth-Reduction Mechanism for Scalable and Power-Efficient Numerical Simulations Based on Finite Difference Methods

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Layout aware optimization of high speed fixed coefficient FIR filters for FPGAs

International Journal of Reconfigurable Computing
Parallel FPGA-based all-pairs shortest-paths in a directed graph

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Exploration of heterogeneous FPGA architectures

International Journal of Reconfigurable Computing - Special issue on selected papers from the international workshop on reconfigurable communication-centric systems on chips (ReCoSoC' 2010)
Accelerating floating-point fitness functions in evolutionary algorithms: a FPGA-CPU-GPU performance comparison

Genetic Programming and Evolvable Machines
FPGA implementation of the conjugate gradient method

PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
An FPGA-Based parallel accelerator for matrix multiplications in the newton-raphson method

EUC'05 Proceedings of the 2005 international conference on Embedded and Ubiquitous Computing
High-Speed Reconfigurable Parallel System to Design Good Error Correcting Codes in Communications

Journal of Signal Processing Systems
A fused hybrid floating-point and fixed-point dot-product for FPGAs

ARC'10 Proceedings of the 6th international conference on Reconfigurable Computing: architectures, Tools and Applications
A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications

Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays
Domain-Specific language and compiler for stencil computation on FPGA-Based systolic computational-memory array

ARC'12 Proceedings of the 8th international conference on Reconfigurable Computing: architectures, tools and applications
Enabling fast ASIP design space exploration: an FPGA-based runtime reconfigurable prototyper

VLSI Design
A performance and energy comparison of convolution on GPUs, FPGAs, and multicore processors

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
A high-performance, low-energy FPGA accelerator for correntropy-based feature tracking (abstract only)

Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays

Quantified Score

Hi-index	0.00

Visualization

Abstract

Field programmable gate arrays (FPGAs) have long been an attractive alternative to microprocessors for computing tasks - as long as floating-point arithmetic is not required. Fueled by the advance of Moore's Law, FPGAs are rapidly reaching sufficient densities to enhance peak floating-point performance as well. The question, however, is how much of this peak performance can be sustained. This paper examines three of the basic linear algebra sub-routine (BLAS) functions: vector dot product, matrix-vector multiply, and matrix multiply. A comparison of microprocessors, FPGAs, and Reconfigurable Computing platforms is performed for each operation. The analysis highlights the amount of memory bandwidth and internal storage needed to sustain peak performance with FPGAs. This analysis considers the historical context of the last six years and is extrapolated for the next six years.