Exploiting the parallelism available in loops

Authors:
David J. Lilja
Affiliations:
Univ. of Minnesota
Venue:
Computer
Year:
1994

Citing 17
Cited 9

Advanced compiler optimizations for supercomputers

Communications of the ACM - Special issue on parallelism
The program dependence graph and its use in optimization

ACM Transactions on Programming Languages and Systems (TOPLAS)
Effect of storage allocation/reclamation methods on parallelism and storage requirements

ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Guided self-scheduling: A practical scheduling scheme for parallel supercomputers

IEEE Transactions on Computers
Strategies for cache and local memory management by global program transformation

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Factoring: a practical and robust method for scheduling parallel loops

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Adjustable block size coherent caches

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Optimizing for parallelism and data locality

ICS '92 Proceedings of the 6th international conference on Supercomputing
Using processor affinity in loop scheduling on shared-memory multiprocessors

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Architectural Alternatives for Exploring Parallelism

Architectural Alternatives for Exploring Parallelism
Optimizing Supercompilers for Supercomputers

Optimizing Supercompilers for Supercomputers
Dependence Analysis for Supercomputing

Dependence Analysis for Supercomputing
Structure of Computers and Computations

Structure of Computers and Computations
Trapezoid Self-Scheduling: A Practical Scheduling Scheme for Parallel Compilers

IEEE Transactions on Parallel and Distributed Systems
Perfect Pipelining: A New Loop Parallelization Technique

ESOP '88 Proceedings of the 2nd European Symposium on Programming

Compile-time minimisation of load imbalance in loop nests

ICS '97 Proceedings of the 11th international conference on Supercomputing
An Effective Processor Allocation Strategy for Multiprogrammed Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Distributed Parallel Query Processing on Networks of Workstations

HPCN Europe 2000 Proceedings of the 8th International Conference on High-Performance Computing and Networking
Efficient Execution of Parallel Applications in Multiprogrammed Multiprocessor Systems

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Performance of a distributed architecture for query processing on workstation clusters

Future Generation Computer Systems - Selected papers from CCGRID 2002
A Loop Transformation for Maximizing Parallelism from Single Loops with Nonuniform Dependencies

HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Derivation of self-scheduling algorithms for heterogeneous distributed computer systems: Application to internet-based grids of computers

Future Generation Computer Systems
FleXilicon architecture and its VLSI implementation

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A combined technique of non-uniform loops

GPC'06 Proceedings of the First international conference on Advances in Grid and Pervasive Computing

Quantified Score

Hi-index	4.10

Visualization

Abstract

Because a loop's body often executes many times, loops provide a rich opportunity for exploiting parallelism. This article explains scheduling techniques and compares results on different architectures. Since parallel architectures differ in synchronization overhead, instruction scheduling constraints, memory latencies, and implementation details, determining the best approach for exploiting parallelism can be difficult. To indicate their performance potential, this article surveys several architectures and compilation techniques using a common notation and consistent terminology. First we develop the critical dependence ratio to determine a loop's maximum possible parallelism, given infinite hardware. Then we look at specific architectures and techniques. Loops can provide a large portion of the parallelism available in an application program, since the iterations of a loop may be executed many times. To exploit this parallelism, however, one must look beyond a single basic block or a single iteration for independent operations. The choice of technique depends on the underlying architecture of the parallel machine and the characteristics of each individual loop.