A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
A Fast Scalable Universal Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Novel Optimizations for Hardware Floating-Point Units in a Modern FPGA Architecture
FPL '02 Proceedings of the Reconfigurable Computing Is Going Mainstream, 12th International Conference on Field-Programmable Logic and Applications
Automating Customisation of Floating-Point Designs
FPL '02 Proceedings of the Reconfigurable Computing Is Going Mainstream, 12th International Conference on Field-Programmable Logic and Applications
Energy-Efficient Matrix Multiplication on FPGAs
FPL '02 Proceedings of the Reconfigurable Computing Is Going Mainstream, 12th International Conference on Field-Programmable Logic and Applications
A Re-evaluation of the Practicality of Floating-Point Operations on FPGAs
FCCM '98 Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines
Using Floating-Point Arithmetic on FPGAs to Accelerate Scientific N-Body Simulations
FCCM '02 Proceedings of the 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Customising Floating-Point Designs
FCCM '02 Proceedings of the 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Leading Zero Anticipation and Detection A Comparison of Methods
ARITH '01 Proceedings of the 15th IEEE Symposium on Computer Arithmetic
High-Performance Matrix Multiplication Algorithms for Architectures withHierarchical Memories
High-Performance Matrix Multiplication Algorithms for Architectures withHierarchical Memories
Floating Point Unit Generation and Evaluation for FPGAs
FCCM '03 Proceedings of the 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
FPGAs vs. CPUs: trends in peak floating-point performance
FPGA '04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays
The MOLEN Polymorphic Processor
IEEE Transactions on Computers
Closing the Gap: CPU and FPGA Trends in Sustainable Floating-Point BLAS Performance
FCCM '04 Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
H-SIMD Machine: Configurable Parallel Computing for Matrix Multiplication
ICCD '05 Proceedings of the 2005 International Conference on Computer Design
High Performance Linear Algebra Operations on Reconfigurable Systems
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Scalable Hybrid Designs for Linear Algebra on Reconfigurable Computing Systems
ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Architectures and APIs: assessing requirements for delivering FPGA performance to applications
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Automatic mapping of nested loops to FPGAS
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
IEEE Transactions on Parallel and Distributed Systems
Examining the viability of FPGA supercomputing
EURASIP Journal on Embedded Systems
A Tool for Unbiased Comparison between Logarithmic and Floating-point Arithmetic
Journal of VLSI Signal Processing Systems
Parameterized floating-point logarithm and exponential functions for FPGAs
Microprocessors & Microsystems
Journal of Parallel and Distributed Computing
Clusters Versus FPGA for Parallel Processing of Hyperspectral Imagery
International Journal of High Performance Computing Applications
Computation reuse in domain-specific optimization of signal recognition
Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation
Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation
A network-centric approach to space-restricted distributed processing
Microprocessors & Microsystems
A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation for Dense Matrices
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Polymorphic architectures: from media processing to supercomputing
CompSysTech '09 Proceedings of the International Conference on Computer Systems and Technologies and Workshop for PhD Students in Computing
Parallel implementation of Cholesky LLT-algorithm in FPGA-based processor
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Proceedings of the 24th ACM International Conference on Supercomputing
OpenMP extensions for FPGA accelerators
SAMOS'09 Proceedings of the 9th international conference on Systems, architectures, modeling and simulation
Fast, Efficient Floating-Point Adders and Multipliers for FPGAs
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
VFloat: A Variable Precision Fixed- and Floating-Point Library for Reconfigurable Hardware
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Domain-Specific Optimization of Signal Recognition Targeting FPGAs
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Automatic generation of fpga-specific pipelined accelerators
ARC'11 Proceedings of the 7th international conference on Reconfigurable computing: architectures, tools and applications
FPGA implementation of the conjugate gradient method
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
An FPGA-Based parallel accelerator for matrix multiplications in the newton-raphson method
EUC'05 Proceedings of the 2005 international conference on Embedded and Ubiquitous Computing
Compressed sensing and Cholesky decomposition on FPGAs and GPUs
Parallel Computing
Towards real-time compression of hyperspectral images using virtex-iI FPGAs
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
FPGA-specific synthesis of loop-nests with pipelined computational cores
Microprocessors & Microsystems
The Journal of Supercomputing
VLIW coprocessor for IEEE-754 quadruple-precision elementary functions
ACM Transactions on Architecture and Code Optimization (TACO)
Scalable matrix decompositions with multiple cores on FPGAs
Microprocessors & Microsystems
Hi-index | 0.00 |
We introduce a 64-bit ANSI/IEEE Std 754-1985 floating point design of a hardware matrix multiplier optimized for FPGA implementations. A general block matrix multiplication algorithm, applicable for an arbitrary matrix size is proposed. The algorithm potentially enables optimum performance by exploiting the data locality and reusability incurred by the general matrix multiplication scheme and considering the limitations of the I/O bandwidth and the local storage volume. We implement a scalable linear array of processing elements (PE) supporting the proposed algorithm in the Xilinx Virtex II Pro technology. Synthesis results confirm a superior performance-area ratio compared to related recent works. Assuming the same FPGA chip, the same amount of local memory, and the same I/O bandwidth, our design outperforms related proposals by at least 1.7X and up to 18X consuming the least reconfigurable resources. A total of 39 PEs can be integrated into the xc2vp125-7 FPGA, reaching performance of, e.g., 15.6 GFLOPS with 1600 KB local memory and 400 MB/s external memory bandwidth.