Iterative Methods for Sparse Linear Systems
Iterative Methods for Sparse Linear Systems
A Hybrid Approach for Mapping Conjugate Gradient onto an FPGA-Augmented Reconfigurable Supercomputer
FCCM '06 Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Implicit and explicit optimizations for stencil computations
Proceedings of the 2006 workshop on Memory system performance and correctness
An Implementation of the Conjugate Gradient Algorithm on FPGAs
FCCM '08 Proceedings of the 2008 16th International Symposium on Field-Programmable Custom Computing Machines
Optimized Stencil Computation Using In-Place Calculation on Modern Multicore Systems
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
A Sparse Matrix Personality for the Convey HC-1
FCCM '11 Proceedings of the 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines
Active pebbles: parallel programming for data-driven applications
Proceedings of the international conference on Supercomputing
Streaming-Enabled Parallel Data Flow Framework in the Visualization ToolKit
Computing in Science and Engineering
FPGA implementation of the conjugate gradient method
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Hi-index | 0.00 |
Employing reconfigurable computing systems for numerical applications poses an interesting and promising approach toward increased performance. We study the applicability of the Convey HC-1 for numerical applications by decomposing a preconditioned conjugate gradient (CG) method into several independent kernels that can operate concurrently. To allow overlapped execution and to minimize data transfers, we stream the data between the kernel units using a central buffer set. A microprogrammable control unit orchestrates memory accesses, buffer writes/reads and kernel execution, and allows for further algorithms to be executedon the available kernel units. Solving the Poisson problem can thereby be accelerated up to 10 times compared to a single-threaded software version on the HC-1 and up to 1.2 times compared to a 2-socket hex-core Intel Xeon Westmere system with 24 hardware threads for large problem sizes with only a single application engine.