Compilers: principles, techniques, and tools
Compilers: principles, techniques, and tools
Advanced compiler optimizations for supercomputers
Communications of the ACM - Special issue on parallelism
Executing a program on the MIT tagged-token dataflow architecture
Volume II: Parallel Languages on PARLE: Parallel Architectures and Languages Europe
Automatic translation of FORTRAN programs to vector form
ACM Transactions on Programming Languages and Systems (TOPLAS)
Guided self-scheduling: A practical scheduling scheme for parallel supercomputers
IEEE Transactions on Computers
Advanced loop optimizations for parallel computers
Proceedings of the 1st International Conference on Supercomputing
On the combination of hardware and software concurrency extraction methods
MICRO 20 Proceedings of the 20th annual workshop on Microprogramming
Dependence graphs and compiler optimizations
POPL '81 Proceedings of the 8th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Structure of Computers and Computations
Structure of Computers and Computations
Speedup of ordinary programs
Optimizing supercompilers for supercomputers
Optimizing supercompilers for supercomputers
Parallelism, memory anti-aliasing and correctness for trace scheduling compilers (disambiguation, flow-analysis, compaction)
Hardware extraction of low-level concurrency from sequential instruction streams (parallelism, implementation, architecture, dependencies, semantics)
On program restructuring, scheduling, and communication for parallel processor systems
On program restructuring, scheduling, and communication for parallel processor systems
The fuzzy barrier: a mechanism for high speed synchronization of processors
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Translation lookaside buffer consistency: a software approach
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Loop optimization in register-transfer scheduling for DSP-systems
DAC '89 Proceedings of the 26th ACM/IEEE Design Automation Conference
Compiler-Assisted Synthesis of Algorithm-Based Checking in Multiprocessors
IEEE Transactions on Computers
Compiling programs for a linear systolic array
PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Vectorization and parallelization of irregular problems via graph coloring
ICS '91 Proceedings of the 5th international conference on Supercomputing
An effective synchronization network for hot-spot accesses
ACM Transactions on Computer Systems (TOCS)
ICS '94 Proceedings of the 8th international conference on Supercomputing
ICS '94 Proceedings of the 8th international conference on Supercomputing
Run-time methods for parallelizing partially parallel loops
ICS '95 Proceedings of the 9th international conference on Supercomputing
A specification invariant technique for operation cost minimisation in flow-graphs
ISSS '94 Proceedings of the 7th international symposium on High-level synthesis
Journal of VLSI Signal Processing Systems - Special issue on systematic trade-off analysis in signal processing systems design
IEEE Transactions on Parallel and Distributed Systems
CSC '91 Proceedings of the 19th annual conference on Computer Science
Journal of VLSI Signal Processing Systems
Data and memory optimization techniques for embedded systems
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Matrix Multiplication on Heterogeneous Platforms
IEEE Transactions on Parallel and Distributed Systems
Automatic data and computation decomposition on distributed memory parallel computers
ACM Transactions on Programming Languages and Systems (TOPLAS)
Partitioning and Labeling of Loops by Unimodular Transformations
IEEE Transactions on Parallel and Distributed Systems
Synchronization and Communication Costs of Loop Partitioning on Shared-Memory Multiprocessor Systems
IEEE Transactions on Parallel and Distributed Systems
On Loop Transformations for Generalized Cycle Shrinking
IEEE Transactions on Parallel and Distributed Systems
Constructive Methods for Scheduling Uniform Loop Nests
IEEE Transactions on Parallel and Distributed Systems
An Efficient Run-Time Scheme for Exploiting Parallelism on Multiprocessor Systems
HiPC '00 Proceedings of the 7th International Conference on High Performance Computing
A Specification Invariant Technique for Regularity Improvement between Flow-Graph Clusters
EDTC '96 Proceedings of the 1996 European conference on Design and Test
A Loop Transformation for Maximizing Parallelism from Single Loops with Nonuniform Dependencies
HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Extracting Parallelism in Nested Loops
COMPSAC '96 Proceedings of the 20th Conference on Computer Software and Applications
Trade-offs in loop transformations
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Optimal loop parallelization for maximizing iteration-level parallelism
CASES '09 Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems
Predecessor/successor approach for high-performance run-time wavefront scheduling
Information Sciences: an International Journal
Loop Transforming for Reducing Data Alignment on Multi-Core SIMD Processors
Journal of Signal Processing Systems
Hi-index | 0.00 |
By examining the structure and characteristics of parallel programs the author isolates potential overhead sources. The first compiler optimization considered is cycle shrinking which can be used to parallelize certain types of serial loops. A run-time dependence analysis is then considered along with how it can be performed through compiler-inserted bookkeeping and control statements. Loops with unstructured parallelism, that cannot benefit from existing optimizations, can be parallelized through run-time dependence checking. Finally, barrier synchronization is discussed as one of the most serious sources of run-time overhead in parallel programs. To reduce the impact of barriers, the author briefly discusses the implementation of distributed barriers through the use of a set of shared registers.