Accurately Selecting Block Size at Runtime in Pipelined Parallel Programs

Authors:
David K. Lowenthal
Affiliations:
Department of Computer Science, The University of Georgia, Athens, Georgia 30602-7404. dkl@cs.uga.edu
Venue:
International Journal of Parallel Programming
Year:
2000

Citing 21
Cited 9

Advanced compiler optimizations for supercomputers

Communications of the ACM - Special issue on parallelism
Automatic translation of FORTRAN programs to vector form

ACM Transactions on Programming Languages and Systems (TOPLAS)
Run-Time Parallelization and Scheduling of Loops

IEEE Transactions on Computers
A static performance estimator to guide data partitioning decisions

PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
Reducing data communication overhead for DOACROSS loop nests

ICS '94 Proceedings of the 8th international conference on Supercomputing
An optimizing Fortran D compiler for MIMD distributed-memory machines

An optimizing Fortran D compiler for MIMD distributed-memory machines
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
On Effective Execution of Nonuniform DOACROSS Loops

IEEE Transactions on Parallel and Distributed Systems
The effect of interrupts on software pipeline execution on message-passing architectures

ICS '96 Proceedings of the 10th international conference on Supercomputing
Compiler and software distributed shared memory support for irregular applications

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Predictive analysis of a wavefront application using LogGP

Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Program Improvement by Source-to-Source Transformation

Journal of the ACM (JACM)
The Paradigm Compiler for Distributed-Memory Multicomputers

Computer
Parallel Programming with Polaris

Computer
Maximizing Multiprocessor Performance with the SUIF Compiler

Computer
Enhancing Software DSM for Compiler-Parallelized Applications

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Run-Time Selection of Block Size in Pipelined Parallel Programs

IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
Loop Allocation Scheme for Multithreaded Dataflow Computers

Proceedings of the 8th International Symposium on Parallel Processing
Run-Time Parallelization of Irregular DOACROSS Loops

IRREGULAR '95 Proceedings of the Second International Workshop on Parallel Algorithms for Irregularly Structured Problems
Shade: A Fast Instruction Set Simulator for Execution Profiling

Shade: A Fast Instruction Set Simulator for Execution Profiling
Communication Optimizations Used in the Paradigm Compiler for Distributed-Memory Multicomputers

ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 02

Efficient support for pipelining in software distributed shared memory systems

Real-time system security
Out-of-Core and Pipeline Techniques for Wavefront Algorithms

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Parameterized tiled loops for free

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Multi-level tiling: M for the price of one

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Studying the impact of synchronization frequency on scheduling tasks with dependencies in heterogeneous systems

Performance Evaluation
Improving the scalability of ILP-based multi-relational concept discovery system through parallelization

Knowledge-Based Systems
Parameterized loop tiling

ACM Transactions on Programming Languages and Systems (TOPLAS)
Towards the optimal synchronization granularity for dynamic scheduling of pipelined computations on heterogeneous computing systems

Concurrency and Computation: Practice & Experience
A practical approach to DOACROSS parallelization

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Loops that contain cross-processor data dependencies, known as {\tt DOACROSS} loops, are often found in scientific programs. Efficiently parallelizing such loops is important yet nontrivial. One useful parallelization technique for {\tt DOACROSS} loops is pipelining, where each processor (node) performs its computation in blocks; after each, it sends data to the next node in the pipeline. The amount of computation before sending a message is called the block size; its choice, although difficult to make statically, is important for efficient execution. This paper describes a flexible runtime approach to choosing the block size. Rather than rely on static estimation of workload, our system takes measurements during the first two iterations of a program and then uses the results to build an execution model and choose an appropriate block size which, unlike a static choice, may be nonuniform. To increase accuracy of the chosen block size, our execution model takes intra- and inter-node performance into account. It is important to note that our system finds an effective block size automatically, without experimentation that is necessary when using a statically chosen block size. Performance on a network of workstations shows that programs that use our runtime analysis outperform those that use static block sizes by as much as 18% when the workload is unbalanced. When the workload is balanced, competitive performance is achieved as long as the initial overhead is sufficiently amortized.