Exploiting Wavefront Parallelism on Large-Scale Shared-Memory Multiprocessors

Authors:
Naraig Manjikian;Tarek S. Abdelrahman
Affiliations:
-;-
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2001

Citing 22
Cited 4

Guided self-scheduling: A practical scheduling scheme for parallel supercomputers

IEEE Transactions on Computers
Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Ultracomputers: a teraflop before its time

Communications of the ACM
Factoring: a method for scheduling parallel loops

Communications of the ACM
Improving locality and parallelism in nested loops

Improving locality and parallelism in nested loops
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Compiler transformations for high-performance computing

ACM Computing Surveys (CSUR)
Compiler cache optimizations for banded matrix problems

ICS '95 Proceedings of the 9th international conference on Supercomputing
On Effective Execution of Nonuniform DOACROSS Loops

IEEE Transactions on Parallel and Distributed Systems
A quantitative analysis of loop nest locality

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Fusion of Loops for Parallelism and Locality

IEEE Transactions on Parallel and Distributed Systems
New tiling techniques to improve cache temporal locality

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
An efficient algorithm for the run-time parallelization of DOACROSS loops

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Parallel Programming with Polaris

Computer
Using Processor Affinity in Loop Scheduling on Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Communication-Minimal Partitioning of Parallel Loops and Data Arrays for Cache-Coherent Distributed-Memory Multiprocessors

LCPC '96 Proceedings of the 9th International Workshop on Languages and Compilers for Parallel Computing
Integrating Scalar Optimization and Parallelization

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
The NUMAchine Multiprocessor

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
(R) On Optimal Size and Shape of Supernode Transformations

ICPP '96 Proceedings of the Proceedings of the 1996 International Conference on Parallel Processing - Volume 3
Software methods for improvement of cache performance on supercomputer applications

Software methods for improvement of cache performance on supercomputer applications

Hyperplane Grouping and Pipelined Schedules: How to Execute Tiled Loops Fast on Clusters of SMPs

The Journal of Supercomputing
Optimization of FDTD computations in a streaming model architecture

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Dynamic multi phase scheduling for heterogeneous cluste

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Streaming model computation of the FDTD problem

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Wavefront parallelism, in which parallelism is limited to hyperplanes in an iteration space, can arise when compilers apply tiling to loop nests to enhance locality. Previous approaches for scheduling wavefront parallelism focused on maximizing parallelism, balancing workloads, and reducing synchronization. In this paper, we show that on large-scale shared-memory multiprocessors, locality is a crucial factor. We make the distinction between intratile and intertile locality and show that as the number of processors grows, intertile locality becomes more important. We consider and experimentally evaluate existing strategies for scheduling wavefront parallelism. We show that dynamic self-scheduling can be efficiently used on a small number of processors, but performs poorly at large scale because it does not enhance intertile locality. By contrast, static scheduling strategies enhance intertile locality for small tiles, maintaining parallelism and resulting in better performance at large scale. Results from a Convex SPP1000 multiprocessor demonstrate the importance of taking intertile locality into account. Static scheduling outperforms dynamic self-scheduling by a factor of up to 2.3 on 30 processors.