Utilizing Multidimensional Loop Parallelism on Large Scale Parallel Processor Systems

Authors:
C. D. Polychronopoulos;David J. Kuck;David A. Padua
Affiliations:
Univ. of Illinois at Urbana-Champaign, Urbana;Univ. of Illinois at Urbana-Champaign, Urbana;Univ. of Illinois ata Urbana-Champaign, Urbana
Venue:
IEEE Transactions on Computers
Year:
1989

Citing 9
Cited 15

Processor Allocation for Horizontal and Vertical Parallelism and Related Speedup Bounds

IEEE Transactions on Computers
Guided self-scheduling: A practical scheduling scheme for parallel supercomputers

IEEE Transactions on Computers
Programs for Digital Signal Processing

Programs for Digital Signal Processing
Dependence graphs and compiler optimizations

POPL '81 Proceedings of the 8th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Structure of Computers and Computations

Structure of Computers and Computations
Multiprocessors: discussion of some theoretical and practical problems

Multiprocessors: discussion of some theoretical and practical problems
Compile-time scheduling and optimization for asynchronous machines (multiprocessor, compiler, parallel processing)

Compile-time scheduling and optimization for asynchronous machines (multiprocessor, compiler, parallel processing)
Compiler optimizations and architecture design issues for multiprocessors (parallel)

Compiler optimizations and architecture design issues for multiprocessors (parallel)
On program restructuring, scheduling, and communication for parallel processor systems

On program restructuring, scheduling, and communication for parallel processor systems

On the combination of hardware and software concurrency extraction methods

ACM SIGMICRO Newsletter
Switch-stacks: a scheme for microtasking nested parallel loops

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Processor allocation and loop scheduling on multiprocessor computers

ICS '92 Proceedings of the 6th international conference on Supercomputing
Combining static and dynamic scheduling on distributed-memory multiprocessors

ICS '94 Proceedings of the 8th international conference on Supercomputing
Valid Transformations: A New Class of Loop Transformations for High-Level Synthesis and Pipelined Scheduling Applications

IEEE Transactions on Parallel and Distributed Systems
On the combination of hardware and software concurrency extraction methods

MICRO 20 Proceedings of the 20th annual workshop on Microprogramming
Pipelined Data Parallel Algorithms-II: Design

IEEE Transactions on Parallel and Distributed Systems
Synthesizing Nested Loop Algorithms Using Nonlinear Transformation Method

IEEE Transactions on Parallel and Distributed Systems
Efficient Processor Assignment Algorithms and Loop Transformations for Executing Nested Parallel Loops on Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Synchronization and Communication Costs of Loop Partitioning on Shared-Memory Multiprocessor Systems

IEEE Transactions on Parallel and Distributed Systems
Optimal Processor Assignment for a Class of Pipelined Computations

IEEE Transactions on Parallel and Distributed Systems
Hierarchical Compilation of Macro Dataflow Graphs for Multiprocessors with Local Memory

IEEE Transactions on Parallel and Distributed Systems
New Software Technologies for the Development and Runtime Support of Complex Applications

International Journal of High Performance Computing Applications
FleXilicon architecture and its VLSI implementation

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Enhanced loop coalescing: a compiler technique for transforming non-uniform iteration spaces

ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems

Quantified Score

Hi-index	14.98

Visualization

Abstract

Program parallelism and processor allocation issues for parallel processor systems are discussed. Optimal processor assignment algorithms are presented for simple and complex nested parallel loops. These processor assignment schemes can be used by the compiler to perform static processor allocation to multiply nested parallel loops. Speedup measurements for EISPACK and IEEE DSP subroutines that result from the optimal assignment of processors to parallel loops are also presented. These measurements indicate that optimal processor assignments result in almost linear speedups on parallel processor machines with a few tens of processes and significantly high speedups for machines with hundreds or thousands of processors.