Measuring the overhead of Intel C++ Concurrent Collections over Threading Building Blocks for Gauss–Jordan elimination

Authors:
Peiyi Tang
Affiliations:
Department of Computer Science, University of Arkansas at Little Rock, Little Rock, AR 72204, USA
Venue:
Concurrency and Computation: Practice & Experience
Year:
2012

Citing 12
Cited 1

Linda in context

Communications of the ACM
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
A Parallel Adaptive Gauss-Jordan Algorithm

The Journal of Supercomputing
Tiling, Block Data Layout, and Memory Hierarchy Performance

IEEE Transactions on Parallel and Distributed Systems
Parallel tiled QR factorization for multicore architectures

Concurrency and Computation: Practice & Experience
Intel threading building blocks

Intel threading building blocks
Declarative aspects of memory management in the concurrent collections parallel programming model

Proceedings of the 4th workshop on Declarative aspects of multicore programming
A class of parallel tiled linear algebra algorithms for multicore architectures

Parallel Computing
A New Parallel Paradigm for Block-Based Gauss-Jordan Algorithm

GCC '09 Proceedings of the 2009 Eighth International Conference on Grid and Cooperative Computing
Comparative study of one-sided factorizations with multiple software packages on multi-core hardware

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Applying the concurrent collections programming model to asynchronous parallel dense linear algebra

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Ease of use with concurrent collections (CnC)

HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism

A C++ library for rapid development of efficient parallel dense linear algebra codes for multicore computers

Proceedings of the 51st ACM Southeast Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

The most efficient way to parallelize computation is to build and evaluate the task graph constrained only by the data dependencies between the tasks. Both Intel's C++ Concurrent Collections (CnC) and Threading Building Blocks (TBB) libraries allow such task-based parallel programming. CnC also adapts the macro data flow model by providing only single-assignment data objects in its global data space. Although CnC makes parallel programming easier, by specifying data flow dependencies only through single-assignment data objects, its macro data flow model incurs overhead. Intel's C++ CnC library is implemented on top of its C++ TBB library. We can measure the overhead of CnC by comparing its performance with that of TBB. In this paper, we analyze all three types of data dependencies in the tiled in-place Gauss–Jordan elimination algorithm for the first time. We implement the task-based parallel tiled Gauss–Jordan algorithm in TBB using the data dependencies analyzed and compare its performance with that of the CnC implementation. We find that the overhead of CnC over TBB is only 12%– 15% of the TBB time, and CnC can deliver as much as 87%– 89% of the TBB performance for Gauss–Jordan elimination, using the optimal tile size. Copyright © 2012 John Wiley & Sons, Ltd.