Guided self-scheduling: A practical scheduling scheme for parallel supercomputers
IEEE Transactions on Computers
POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Ultracomputers: a teraflop before its time
Communications of the ACM
Factoring: a method for scheduling parallel loops
Communications of the ACM
Improving locality and parallelism in nested loops
Improving locality and parallelism in nested loops
The Stanford FLASH multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Compiler transformations for high-performance computing
ACM Computing Surveys (CSUR)
Compiler cache optimizations for banded matrix problems
ICS '95 Proceedings of the 9th international conference on Supercomputing
On Effective Execution of Nonuniform DOACROSS Loops
IEEE Transactions on Parallel and Distributed Systems
A quantitative analysis of loop nest locality
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Fusion of Loops for Parallelism and Locality
IEEE Transactions on Parallel and Distributed Systems
New tiling techniques to improve cache temporal locality
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
An efficient algorithm for the run-time parallelization of DOACROSS loops
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Parallel Programming with Polaris
Computer
Using Processor Affinity in Loop Scheduling on Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
LCPC '96 Proceedings of the 9th International Workshop on Languages and Compilers for Parallel Computing
Integrating Scalar Optimization and Parallelization
Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
(R) On Optimal Size and Shape of Supernode Transformations
ICPP '96 Proceedings of the Proceedings of the 1996 International Conference on Parallel Processing - Volume 3
Software methods for improvement of cache performance on supercomputer applications
Software methods for improvement of cache performance on supercomputer applications
Hyperplane Grouping and Pipelined Schedules: How to Execute Tiled Loops Fast on Clusters of SMPs
The Journal of Supercomputing
Optimization of FDTD computations in a streaming model architecture
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Dynamic multi phase scheduling for heterogeneous cluste
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Streaming model computation of the FDTD problem
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I
Hi-index | 0.00 |
Wavefront parallelism, in which parallelism is limited to hyperplanes in an iteration space, can arise when compilers apply tiling to loop nests to enhance locality. Previous approaches for scheduling wavefront parallelism focused on maximizing parallelism, balancing workloads, and reducing synchronization. In this paper, we show that on large-scale shared-memory multiprocessors, locality is a crucial factor. We make the distinction between intratile and intertile locality and show that as the number of processors grows, intertile locality becomes more important. We consider and experimentally evaluate existing strategies for scheduling wavefront parallelism. We show that dynamic self-scheduling can be efficiently used on a small number of processors, but performs poorly at large scale because it does not enhance intertile locality. By contrast, static scheduling strategies enhance intertile locality for small tiles, maintaining parallelism and resulting in better performance at large scale. Results from a Convex SPP1000 multiprocessor demonstrate the importance of taking intertile locality into account. Static scheduling outperforms dynamic self-scheduling by a factor of up to 2.3 on 30 processors.