Collaborative scheduling of DAG structured computations on multicore processors
Proceedings of the 7th ACM international conference on Computing frontiers
Journal of Computational and Applied Mathematics
An implementation of the tile QR factorization for a GPU and multiple CPUs
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
DAG-Based software frameworks for PDEs
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
LIBKOMP, an efficient openMP runtime system for both fork-join and data flow paradigms
IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Performance study of matrix computations using multi-core programming tools
Proceedings of the Fifth Balkan Conference in Informatics
High-performance general solver for extremely large-scale semidefinite programming problems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Cache-sensitive MapReduce DGEMM algorithms for shared memory architectures
Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference
Fast parallel algorithms for blocked dense matrix multiplication on shared memory architectures
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Work-efficient matrix inversion in polylogarithmic time
Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
An improved parallel singular value algorithm and its implementation for multicore hardware
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
State-of-the-art dense linear algebra software, such as the LAPACK and ScaLAPACK libraries, suffers performance losses on multicore processors due to their inability to fully exploit thread-level parallelism. At the same time, the coarse–grain dataflow model gains popularity as a paradigm for programming multicore architectures. This work looks at implementing classic dense linear algebra workloads, the Cholesky factorization, the QR factorization and the LU factorization, using dynamic data-driven execution. Two emerging approaches to implementing coarse–grain dataflow are examined, the model of nested parallelism, represented by the Cilk framework, and the model of parallelism expressed through an arbitrary Direct Acyclic Graph, represented by the SMP Superscalar framework. Performance and coding effort are analyzed and compared against code manually parallelized at the thread level. Copyright © 2009 John Wiley & Sons, Ltd.