Communications of the ACM
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
A Parallel Adaptive Gauss-Jordan Algorithm
The Journal of Supercomputing
Tiling, Block Data Layout, and Memory Hierarchy Performance
IEEE Transactions on Parallel and Distributed Systems
Parallel tiled QR factorization for multicore architectures
Concurrency and Computation: Practice & Experience
Intel threading building blocks
Intel threading building blocks
Declarative aspects of memory management in the concurrent collections parallel programming model
Proceedings of the 4th workshop on Declarative aspects of multicore programming
A New Parallel Paradigm for Block-Based Gauss-Jordan Algorithm
GCC '09 Proceedings of the 2009 Eighth International Conference on Grid and Cooperative Computing
Comparative study of one-sided factorizations with multiple software packages on multi-core hardware
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Applying the concurrent collections programming model to asynchronous parallel dense linear algebra
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Ease of use with concurrent collections (CnC)
HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
Proceedings of the 51st ACM Southeast Conference
Hi-index | 0.00 |
The most efficient way to parallelize computation is to build and evaluate the task graph constrained only by the data dependencies between the tasks. Both Intel's C++ Concurrent Collections (CnC) and Threading Building Blocks (TBB) libraries allow such task-based parallel programming. CnC also adapts the macro data flow model by providing only single-assignment data objects in its global data space. Although CnC makes parallel programming easier, by specifying data flow dependencies only through single-assignment data objects, its macro data flow model incurs overhead. Intel's C++ CnC library is implemented on top of its C++ TBB library. We can measure the overhead of CnC by comparing its performance with that of TBB. In this paper, we analyze all three types of data dependencies in the tiled in-place Gauss–Jordan elimination algorithm for the first time. We implement the task-based parallel tiled Gauss–Jordan algorithm in TBB using the data dependencies analyzed and compare its performance with that of the CnC implementation. We find that the overhead of CnC over TBB is only 12%– 15% of the TBB time, and CnC can deliver as much as 87%– 89% of the TBB performance for Gauss–Jordan elimination, using the optimal tile size. Copyright © 2012 John Wiley & Sons, Ltd.