Advanced compiler optimizations for supercomputers
Communications of the ACM - Special issue on parallelism
Incremental performance contributions of hardware concurrency extraction techniques
Proceedings of the 1st International Conference on Supercomputing
A compilation technique for software pipelining of loops with conditional jumps
MICRO 20 Proceedings of the 20th annual workshop on Microprogramming
GURPR—a method for global software pipelining
MICRO 20 Proceedings of the 20th annual workshop on Microprogramming
On the combination of hardware and software concurrency extraction methods
MICRO 20 Proceedings of the 20th annual workshop on Microprogramming
Perfect Pipelining: A New Loop Parallelization Technique
ESOP '88 Proceedings of the 2nd European Symposium on Programming
Hardware extraction of low-level concurrency from sequential instruction streams (parallelism, implementation, architecture, dependencies, semantics)
On program restructuring, scheduling, and communication for parallel processor systems
On program restructuring, scheduling, and communication for parallel processor systems
On optimal loop parallelization
MICRO 22 Proceedings of the 22nd annual workshop on Microprogramming and microarchitecture
A preceding activation scheme with graph unfolding for the parallel processing system-array
Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Requirements for Optimal Execution of Loops with Tests
IEEE Transactions on Parallel and Distributed Systems
Hi-index | 0.01 |
Both the efficient execution of branch intensive code and knowing the bounds on same are important issues in computing in general and supercomputing in particular. In prior work, it has been suggested, implied, or left as a possible maximum, that the hardware needed to execute code with branches optimally, i.e., oracular performance, is exponentially dependent on the total number of dynamic branches to be executed, this number of branches being proportional at least to the number of iterations of the loop. For classes of code taking at least one cycle per iteration to execute, this is not the case. For loops containing one test (normally in the form of a Boolean recurrence of order 1), it is shown that the hardware necessary varies from exponential to polynomial in the length of the dependency cycle L, while execution time varies from one time cycle per iteration to less than L time cycles per iteration; the variation depends on specific code dependencies.