Advanced compiler optimizations for supercomputers
Communications of the ACM - Special issue on parallelism
The program dependence graph and its use in optimization
ACM Transactions on Programming Languages and Systems (TOPLAS)
Effect of storage allocation/reclamation methods on parallelism and storage requirements
ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Guided self-scheduling: A practical scheduling scheme for parallel supercomputers
IEEE Transactions on Computers
Strategies for cache and local memory management by global program transformation
Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Software pipelining: an effective scheduling technique for VLIW machines
PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Factoring: a practical and robust method for scheduling parallel loops
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Adjustable block size coherent caches
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Optimizing for parallelism and data locality
ICS '92 Proceedings of the 6th international conference on Supercomputing
Using processor affinity in loop scheduling on shared-memory multiprocessors
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Architectural Alternatives for Exploring Parallelism
Architectural Alternatives for Exploring Parallelism
Optimizing Supercompilers for Supercomputers
Optimizing Supercompilers for Supercomputers
Dependence Analysis for Supercomputing
Dependence Analysis for Supercomputing
Structure of Computers and Computations
Structure of Computers and Computations
Trapezoid Self-Scheduling: A Practical Scheduling Scheme for Parallel Compilers
IEEE Transactions on Parallel and Distributed Systems
Perfect Pipelining: A New Loop Parallelization Technique
ESOP '88 Proceedings of the 2nd European Symposium on Programming
Compile-time minimisation of load imbalance in loop nests
ICS '97 Proceedings of the 11th international conference on Supercomputing
An Effective Processor Allocation Strategy for Multiprogrammed Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
Distributed Parallel Query Processing on Networks of Workstations
HPCN Europe 2000 Proceedings of the 8th International Conference on High-Performance Computing and Networking
Efficient Execution of Parallel Applications in Multiprogrammed Multiprocessor Systems
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Performance of a distributed architecture for query processing on workstation clusters
Future Generation Computer Systems - Selected papers from CCGRID 2002
A Loop Transformation for Maximizing Parallelism from Single Loops with Nonuniform Dependencies
HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Future Generation Computer Systems
FleXilicon architecture and its VLSI implementation
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A combined technique of non-uniform loops
GPC'06 Proceedings of the First international conference on Advances in Grid and Pervasive Computing
Hi-index | 4.10 |
Because a loop's body often executes many times, loops provide a rich opportunity for exploiting parallelism. This article explains scheduling techniques and compares results on different architectures. Since parallel architectures differ in synchronization overhead, instruction scheduling constraints, memory latencies, and implementation details, determining the best approach for exploiting parallelism can be difficult. To indicate their performance potential, this article surveys several architectures and compilation techniques using a common notation and consistent terminology. First we develop the critical dependence ratio to determine a loop's maximum possible parallelism, given infinite hardware. Then we look at specific architectures and techniques. Loops can provide a large portion of the parallelism available in an application program, since the iterations of a loop may be executed many times. To exploit this parallelism, however, one must look beyond a single basic block or a single iteration for independent operations. The choice of technique depends on the underlying architecture of the parallel machine and the characteristics of each individual loop.