Applied numerical linear algebra
Applied numerical linear algebra
Asynchronous Iterative Methods for Multiprocessors
Journal of the ACM (JACM)
Journal of Computational and Applied Mathematics - Special issue on numerical analysis 2000 Vol. III: linear algebra
Sparse matrix solvers on the GPU: conjugate gradients and multigrid
ACM SIGGRAPH 2003 Papers
Parallel discrete event simulation with application to continuous systems
Parallel discrete event simulation with application to continuous systems
Journal of Computational Physics
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Optimization of sparse matrix-vector multiplication on emerging multicore platforms
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Improving parallelism and locality with asynchronous algorithms
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Best-effort semantic document search on GPUs
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Proceedings of the 24th ACM International Conference on Supercomputing
Maestro: data orchestration and tuning for OpenCL devices
Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
International Journal of High Performance Computing Applications
Data layout transformation for stencil computations on short-vector SIMD architectures
CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
A static task partitioning approach for heterogeneous systems using OpenCL
CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
CUDA 2d stencil computations for the jacobi method
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I
Extendable pattern-oriented optimization directives
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Parallelizing SOR for GPGPUs using alternate loop tiling
Parallel Computing
Compiler and runtime support for enabling reduction computations on heterogeneous systems
Concurrency and Computation: Practice & Experience
An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs
Proceedings of the 26th ACM international conference on Supercomputing
A compiler-assisted runtime-prefetching scheme for heterogeneous platforms
IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Accelerating the red/black SOR method using GPUs with CUDA
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Extendable pattern-oriented optimization directives
ACM Transactions on Architecture and Code Optimization (TACO)
Stencil computations on heterogeneous platforms for the Jacobi method: GPUs versus Cell BE
The Journal of Supercomputing
Accelerating MapReduce on a coupled CPU-GPU architecture
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
CAP: co-scheduling based on asymptotic profiling in CPU+GPU hybrid systems
Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
G-Charm: an adaptive runtime system for message-driven parallel applications on hybrid systems
Proceedings of the 27th international ACM conference on International conference on supercomputing
Glinda: a framework for accelerating imbalanced applications on heterogeneous platforms
Proceedings of the ACM International Conference on Computing Frontiers
EuroVis '13 Proceedings of the 15th Eurographics Conference on Visualization
Hi-index | 0.00 |
We describe heterogeneous multi-CPU and multi-GPU implementations of Jacobi's iterative method for the 2-D Poisson equation on a structured grid, in both single- and double-precision. Properly tuned, our best implementation achieves 98% of the empirical streaming GPU bandwidth (66% of peak) on a NVIDIA C1060, and 78% on a C870. Motivated to find a still faster implementation, we further consider "wildly asynchronous" implementations that can reduce or even eliminate the synchronization bottleneck between iterations. In these versions, which are based on chaotic relaxation (Chazan and Miranker, 1969), we simply remove or delay synchronization between iterations. By doing so, we trade-off more flops, via more iterations to converge, for a higher degree of asynchronous parallelism. Our wild implementations on a GPU can be 1.2-2.5x faster than our best synchronized GPU implementation while achieving the same accuracy. Looking forward, this result suggests research on similarly "fast-and-loose" algorithms in the coming era of increasingly massive concurrency and relatively high synchronization or communication costs.