GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark
ACM Transactions on Mathematical Software (TOMS)
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
Accuracy and Stability of Numerical Algorithms
Accuracy and Stability of Numerical Algorithms
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
GPU Gems 2: Programming Techniques for High-Performance Graphics and General-Purpose Computation (Gpu Gems)
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Larrabee: a many-core x86 architecture for visual computing
ACM SIGGRAPH 2008 papers
ACM Transactions on Mathematical Software (TOMS)
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Exploring New Architectures in Accelerating CFD for Air Force Applications
HPCMP-UGC '08 Proceedings of the 2008 DoD HPCMP Users Group Conference
Solving dense linear systems on platforms with multiple hardware accelerators
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Accelerating linpack with CUDA on heterogenous clusters
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
The impact of multicore on math software
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Prospectus for the next LAPACK and ScaLAPACK libraries
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Assessing the computational benefits of AREA-oriented DAG-scheduling
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Soft error resilient QR factorization for hybrid system with GPGPU
Proceedings of the second workshop on Scalable algorithms for large-scale systems
An implementation of the tile QR factorization for a GPU and multiple CPUs
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Sparse LU factorization for parallel circuit simulation on GPU
Proceedings of the 49th Annual Design Automation Conference
A Map-Reduce Based Framework for Heterogeneous Processing Element Cluster Environments
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems
SIAM Journal on Scientific Computing
CALU: A Communication Optimal LU Factorization Algorithm
SIAM Journal on Matrix Analysis and Applications
Towards distributed heterogenous high-performance computing with ViennaCL
LSSC'11 Proceedings of the 8th international conference on Large-Scale Scientific Computing
G-Charm: an adaptive runtime system for message-driven parallel applications on hybrid systems
Proceedings of the 27th international ACM conference on International conference on supercomputing
Glinda: a framework for accelerating imbalanced applications on heterogeneous platforms
Proceedings of the ACM International Conference on Computing Frontiers
CPU-GPU hybrid bidiagonal reduction with soft error resilience
ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
Efficient 3D stencil computations using CUDA
Parallel Computing
Proceedings of Programming Models and Applications on Multicores and Manycores
Hi-index | 0.00 |
We highlight the trends leading to the increased appeal of using hybrid multicore+GPU systems for high performance computing. We present a set of techniques that can be used to develop efficient dense linear algebra algorithms for these systems. We illustrate the main ideas with the development of a hybrid LU factorization algorithm where we split the computation over a multicore and a graphics processor, and use particular techniques to reduce the amount of pivoting and communication between the hybrid components. This results in an efficient algorithm with balanced use of a multicore processor and a graphics processor.