Quantitative system performance: computer system analysis using queueing network models
Quantitative system performance: computer system analysis using queueing network models
Numerical recipes: the art of scientific computing
Numerical recipes: the art of scientific computing
Parallel and distributed computation: numerical methods
Parallel and distributed computation: numerical methods
Survey of closed queueing networks with blocking
ACM Computing Surveys (CSUR)
Run-Time Parallelization and Scheduling of Loops
IEEE Transactions on Computers
ICS '94 Proceedings of the 8th international conference on Supercomputing
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Schedulers as abstract interpretations of higher-dimensional automata
PEPM '95 Proceedings of the 1995 ACM SIGPLAN symposium on Partial evaluation and semantics-based program manipulation
Run-time methods for parallelizing partially parallel loops
ICS '95 Proceedings of the 9th international conference on Supercomputing
Performance debugging shared memory parallel programs using run-time dependence analysis
SIGMETRICS '97 Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Dynamic feedback: an effective technique for adaptive computing
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
IEEE Transactions on Parallel and Distributed Systems
Eliminating synchronization overhead in automatically parallelized programs using dynamic feedback
ACM Transactions on Computer Systems (TOCS)
Architectural support for scalable speculative parallelization in shared-memory multiprocessors
Proceedings of the 27th annual international symposium on Computer architecture
Time Stamp Algorithms for Runtime Parallelization of DOACROSS Loops with Dynamic Dependences
IEEE Transactions on Parallel and Distributed Systems
Techniques for speculative run-time parallelization of loops
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
An efficient algorithm for the run-time parallelization of DOACROSS loops
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Time-Stamping Algorithms for Parallelization of Loops at Run-Time
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Principles of Speculative Run-Time Parallelization
LCPC '98 Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing
Techniques for Reducing the Overhead of Run-Time Parallelization
CC '00 Proceedings of the 9th International Conference on Compiler Construction
An Efficient Run-Time Scheme for Exploiting Parallelism on Multiprocessor Systems
HiPC '00 Proceedings of the 7th International Conference on High Performance Computing
Toward efficient and robust software speculative parallelization on multiprocessors
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
The Illinois Aggressive Coma Multiprocessor project (I-ACOMA)
FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Design Space Exploration of a Software Speculative Parallelization Scheme
IEEE Transactions on Parallel and Distributed Systems
Sensitivity analysis for automatic parallelization on multi-cores
Proceedings of the 21st annual international conference on Supercomputing
The potential of trace-level parallelism in Java programs
Proceedings of the 5th international symposium on Principles and practice of programming in Java
Implementation of Sensitivity Analysis for Automatic Parallelization
Languages and Compilers for Parallel Computing
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
A study of potential parallelism among traces in Java programs
Science of Computer Programming
Predecessor/successor approach for high-performance run-time wavefront scheduling
Information Sciences: an International Journal
Probabilistic program analysis for parallelizing compilers
VECPAR'04 Proceedings of the 6th international conference on High Performance Computing for Computational Science
Code generation for parallel execution of a class of irregular loops on distributed memory systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Automatic communication coalescing for irregular computations in UPC language
CASCON '12 Proceedings of the 2012 Conference of the Center for Advanced Studies on Collaborative Research
Hi-index | 0.00 |
When the inter-iteration dependency pattern of the iterations of a loop cannot be determined statically, compile time parallelization of the loop is not possible. In these cases, runtime parallelization [8] is the only alternative. The idea is to transform the loop into two code fragements: the inspector and the executor. When the program is run, the inspector examines the iteration dependencies and constructs a parallel schedule. The executor subsequently uses that schedule to carry out the actual computation in parallel.In this paper, we show how to reduce the overhead of running the inspector through its parallel execution. We describe two related approaches. The first, which emphasizes inspector efficiency, achieves nearly linear speedup relative to a sequential execution of the inspector, but produces a schedule that may be less efficient for the executor. The second technique, which emphasizes executor efficiency, does not in general achieve linear speedup of the inspector, but is guaranteed to produce the best achievable schedule. We present these techniques, show that they are correct, and compare their performance to existing techniques using a set of experiments.Because in this paper we are optimizing inspector time, but leaving the executor unchanged, the techniques we present have most dramatic effect when the inspector must be run for each invocation of the source loop. In a companion paper [3], we explore techniques that build upon those developed here to also improve executor performance.