An extended set of FORTRAN basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Basic Linear Algebra Subprograms for Fortran Usage
ACM Transactions on Mathematical Software (TOMS)
The science of deriving dense linear algebra algorithms
ACM Transactions on Mathematical Software (TOMS)
Representing linear algebra algorithms in code: the FLAME application program interfaces
ACM Transactions on Mathematical Software (TOMS)
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures
PDP '08 Proceedings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008)
SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Anatomy of high-performance matrix multiplication
ACM Transactions on Mathematical Software (TOMS)
Solving dense linear systems on platforms with multiple hardware accelerators
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Programming matrix algorithms-by-blocks for thread-level parallelism
ACM Transactions on Mathematical Software (TOMS)
Satisfying your dependencies with SuperMatrix
CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
Exploiting the capabilities of modern GPUs for dense matrix computations
Concurrency and Computation: Practice & Experience
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures
Concurrency and Computation: Practice & Experience - Euro-Par 2009
Editorial: Special issue editorial: Accelerators for high-performance computing
Journal of Parallel and Distributed Computing
Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Design patterns for sparse-matrix computations on hybrid CPU/GPU platforms
Scientific Programming
Hi-index | 0.00 |
Parallel accelerators are playing an increasingly important role in scientific computing. However, it is perceived that their weakness nowadays is their reduced ''programmability'' in comparison with traditional general-purpose CPUs. For the domain of dense linear algebra, we demonstrate that this is not necessarily the case. We show how the libflame library carefully layers routines and abstracts details related to storage and computation, so that extending it to take advantage of multiple accelerators is achievable without introducing platform specific complexity into the library code base. We focus on the experience of the library developer as he develops a library routine for a new operation, reduction of a generalized Hermitian positive definite eigenvalue problem to a standard Hermitian form, and configures the library to target a multi-GPU platform. It becomes obvious that the library developer does not need to know about the parallelization or the details of the multi-accelerator platform. Excellent performance on a system with four NVIDIA Tesla C2050 GPUs is reported. This makes libflame the first library to be released that incorporates multi-GPU functionality for dense matrix computations, setting a new standard for performance.