A fast algorithm for particle simulations
Journal of Computational Physics
Applied numerical linear algebra
Applied numerical linear algebra
A kernel-independent adaptive fast multipole algorithm in two and three dimensions
Journal of Computational Physics
A New Parallel Kernel-Independent Fast Multipole Method
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Sparsity: Optimization Framework for Sparse Matrix Kernels
International Journal of High Performance Computing Applications
Fast multipole methods on graphics processors
Journal of Computational Physics
Implementing sparse matrix-vector multiplication on throughput-oriented processors
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A massively parallel adaptive fast-multipole method on heterogeneous architectures
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Direct N-body Kernels for Multicore Platforms
ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
Model-driven autotuning of sparse matrix-vector multiply on GPUs
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
High-performance direct solution of finite element problems on multi-core processors
High-performance direct solution of finite element problems on multi-core processors
Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Journal of Parallel and Distributed Computing
Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark
ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Considerations when evaluating microprocessor platforms
HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Lessons learned from exploring the backtracking paradigm on the GPU
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Optimization of N-queens solvers on graphics processors
APPT'11 Proceedings of the 9th international conference on Advanced parallel processing technologies
Mathematics and Computers in Simulation
A GPU implementation of inclusion-based points-to analysis
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
GPU acceleration of the matrix-free interior point method
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
GPU-based parallel vertex substitution algorithm for the p-median problem
Computers and Industrial Engineering
Adaptive fast multipole methods on the GPU
The Journal of Supercomputing
Fast on-line statistical learning on a GPGPU
AusPDC '11 Proceedings of the Ninth Australasian Symposium on Parallel and Distributed Computing - Volume 118
Efficient 3D stencil computations using CUDA
Parallel Computing
Computers & Mathematics with Applications
Hi-index | 0.00 |
This paper throws a small "wet blanket" on the hot topic of GPGPU acceleration, based on experience analyzing and tuning both multithreaded CPU and GPU implementations of three computations in scientific computing. These computations--(a) iterative sparse linear solvers; (b) sparse Cholesky factorization; and (c) the fast multipole method--exhibit complex behavior and vary in computational intensity and memory reference irregularity. In each case, algorithmic analysis and prior work might lead us to conclude that an idealized GPU can deliver better performance, but we find that for at least equal-effort CPU tuning and consideration of realistic workloads and calling-contexts, we can with two modern quad-core CPU sockets roughly match one or two GPUs in performance. Our conclusions are not intended to dampen interest in GPU acceleration; on the contrary, they should do the opposite: they partially illuminate the boundary between CPU and GPU performance, and ask architects to consider application contexts in the design of future coupled on-die CPU/GPU processors.