Advanced compiler optimizations for supercomputers
Communications of the ACM - Special issue on parallelism
Automatic translation of FORTRAN programs to vector form
ACM Transactions on Programming Languages and Systems (TOPLAS)
Run-Time Parallelization and Scheduling of Loops
IEEE Transactions on Computers
A static performance estimator to guide data partitioning decisions
PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
Reducing data communication overhead for DOACROSS loop nests
ICS '94 Proceedings of the 8th international conference on Supercomputing
An optimizing Fortran D compiler for MIMD distributed-memory machines
An optimizing Fortran D compiler for MIMD distributed-memory machines
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
On Effective Execution of Nonuniform DOACROSS Loops
IEEE Transactions on Parallel and Distributed Systems
The effect of interrupts on software pipeline execution on message-passing architectures
ICS '96 Proceedings of the 10th international conference on Supercomputing
Compiler and software distributed shared memory support for irregular applications
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Predictive analysis of a wavefront application using LogGP
Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Program Improvement by Source-to-Source Transformation
Journal of the ACM (JACM)
Parallel Programming with Polaris
Computer
Enhancing Software DSM for Compiler-Parallelized Applications
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Run-Time Selection of Block Size in Pipelined Parallel Programs
IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
Loop Allocation Scheme for Multithreaded Dataflow Computers
Proceedings of the 8th International Symposium on Parallel Processing
Run-Time Parallelization of Irregular DOACROSS Loops
IRREGULAR '95 Proceedings of the Second International Workshop on Parallel Algorithms for Irregularly Structured Problems
Shade: A Fast Instruction Set Simulator for Execution Profiling
Shade: A Fast Instruction Set Simulator for Execution Profiling
Communication Optimizations Used in the Paradigm Compiler for Distributed-Memory Multicomputers
ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 02
Efficient support for pipelining in software distributed shared memory systems
Real-time system security
Out-of-Core and Pipeline Techniques for Wavefront Algorithms
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Parameterized tiled loops for free
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Multi-level tiling: M for the price of one
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
ACM Transactions on Programming Languages and Systems (TOPLAS)
Concurrency and Computation: Practice & Experience
A practical approach to DOACROSS parallelization
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Hi-index | 0.00 |
Loops that contain cross-processor data dependencies, known as {\tt DOACROSS} loops, are often found in scientific programs. Efficiently parallelizing such loops is important yet nontrivial. One useful parallelization technique for {\tt DOACROSS} loops is pipelining, where each processor (node) performs its computation in blocks; after each, it sends data to the next node in the pipeline. The amount of computation before sending a message is called the block size; its choice, although difficult to make statically, is important for efficient execution. This paper describes a flexible runtime approach to choosing the block size. Rather than rely on static estimation of workload, our system takes measurements during the first two iterations of a program and then uses the results to build an execution model and choose an appropriate block size which, unlike a static choice, may be nonuniform. To increase accuracy of the chosen block size, our execution model takes intra- and inter-node performance into account. It is important to note that our system finds an effective block size automatically, without experimentation that is necessary when using a statically chosen block size. Performance on a network of workstations shows that programs that use our runtime analysis outperform those that use static block sizes by as much as 18% when the workload is unbalanced. When the workload is balanced, competitive performance is achieved as long as the initial overhead is sufficiently amortized.