HPSm, a high performance restricted data flow architecture having minimal functionality
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Computers and Intractability: A Guide to the Theory of NP-Completeness
Computers and Intractability: A Guide to the Theory of NP-Completeness
Structure of Computers and Computations
Structure of Computers and Computations
Speedup of ordinary programs
Optimizing supercompilers for supercomputers
Optimizing supercompilers for supercomputers
Compile-time scheduling and optimization for asynchronous machines (multiprocessor, compiler, parallel processing)
Hardware extraction of low-level concurrency from sequential instruction streams (parallelism, implementation, architecture, dependencies, semantics)
On program restructuring, scheduling, and communication for parallel processor systems
On program restructuring, scheduling, and communication for parallel processor systems
Measuring the Parallelism Available for Very Long Instruction Word Architectures
IEEE Transactions on Computers
Practical Multiprocessor Scheduling Algorithms for Efficient Parallel Processing
IEEE Transactions on Computers
Parallelism and Representation Problems in Distributed Systems
IEEE Transactions on Computers
Multiprocessor Scheduling with the Aid of Network Flow Algorithms
IEEE Transactions on Software Engineering
Validity of the single processor approach to achieving large scale computing capabilities
AFIPS '67 (Spring) Proceedings of the April 18-20, 1967, spring joint computer conference
Guided self-scheduling: A practical scheduling scheme for parallel supercomputers
IEEE Transactions on Computers
Exploiting variable grain parallelism at runtime
PPEALS '88 Proceedings of the ACM/SIGPLAN conference on Parallel programming: experience with applications, languages and systems
On the combination of hardware and software concurrency extraction methods
ACM SIGMICRO Newsletter
Measuring the scalability of parallel computer systems
Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Utilizing Multidimensional Loop Parallelism on Large Scale Parallel Processor Systems
IEEE Transactions on Computers
ICS '91 Proceedings of the 5th international conference on Supercomputing
Stochastic performance models of parallel task systems (extended abstract)
SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
On the combination of hardware and software concurrency extraction methods
MICRO 20 Proceedings of the 20th annual workshop on Microprogramming
Cyclic Staggered Scheme: A Loop Allocation Policy for DOACROSS Loops
IEEE Transactions on Computers
IEEE Transactions on Parallel and Distributed Systems
Load Balancing Highly Irregular Computations with the Adaptive Factoring
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Processor Allocation and Task Scheduling of Matrix Chain Products on Parallel Systems
IEEE Transactions on Parallel and Distributed Systems
Paper: Assigning dependency graphs onto processor networks
Parallel Computing
Hi-index | 14.99 |
The main aim of the paper is to study allocation of processors to, parallel programs executing on a multiprocessor system, and the resulting speedups. First, we consider a parallel program as a sequence of steps where each step consists of a set of parallel operations. General bounds on the speedup on a p- processor system are derived based on this model. Measurements of code parallelism for the, LINPACK numerical package are presented to support the belief that typical numerical programs contain much potential parallelism that can be discovered by a good restructuring compiler. Next, a parallel program is represented as a task graph whose nodes are do across loops (i.e., loops whose iterations can be partially, overlapped). It is shown how processors can be allocated to exploit horizontal and vertical parallelism in such graphs. Two processor allocation heuristic algorithms (WP and PA) are presented. PA is the heart of the WP and is used to obtain efficient processor allocations for a set of independent parallel tasks. WP allocates processors to general task graphs. Finally, a general formula for the speedup of a DO across loop is given that is more accurate than the known formula.