Proceedings of the fourth workshop on I/O in parallel and distributed systems: part of the federated computing research conference
Matrix computations (3rd ed.)
FLAME: Formal Linear Algebra Methods Environment
ACM Transactions on Mathematical Software (TOMS)
ECOOP '98 Workshop ion on Object-Oriented Technology
ISCOPE '98 Proceedings of the Second International Symposium on Computing in Object-Oriented Parallel Environments
Milestones in Computer Science and Information Technology
Milestones in Computer Science and Information Technology
The science of deriving dense linear algebra algorithms
ACM Transactions on Mathematical Software (TOMS)
Parallel out-of-core computation and updating of the QR factorization
ACM Transactions on Mathematical Software (TOMS)
Computational methods and processing strategies for estimating earth's gravity field
Computational methods and processing strategies for estimating earth's gravity field
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Updating an LU Factorization with Pivoting
ACM Transactions on Mathematical Software (TOMS)
Operating System Concepts
Programming matrix algorithms-by-blocks for thread-level parallelism
ACM Transactions on Mathematical Software (TOMS)
Prefetching with Helper Threads for Loosely Coupled Multiprocessor Systems
IEEE Transactions on Parallel and Distributed Systems
Out-of-Core Implementations of Cholesky Factorization: Loop-Based versus Recursive Algorithms
SIAM Journal on Matrix Analysis and Applications
Exploiting the capabilities of modern GPUs for dense matrix computations
Concurrency and Computation: Practice & Experience
Rapid development of high-performance out-of-core solvers
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Hi-index | 0.00 |
Out-of-core implementations of algorithms for dense matrix computations have traditionally focused on optimal use of memory so as to minimize I/O, often trading programmability for performance. In this article we show how the current state of hardware and software allows the programmability problem to be addressed without sacrificing performance. This comes from the realizations that memory is cheap and large, making it less necessary to optimally orchestrate I/O, and that new algorithms view matrices as collections of submatrices and computation as operations with those submatrices. This enables libraries to be coded at a high level of abstraction, leaving the tasks of scheduling the computations and data movement in the hands of a runtime system. This is in sharp contrast to more traditional approaches that leverage optimal use of in-core memory and, at the expense of introducing considerable programming complexity, explicit overlap of I/O with computation. Performance is demonstrated for this approach on multicore architectures as well as platforms equipped with hardware accelerators.