A Loop Transformation Algorithm for Communication Overlapping
International Journal of Parallel Programming - Special issue on international symposium on high performance computing 1997, part I
Accurately Selecting Block Size at Runtime in Pipelined Parallel Programs
International Journal of Parallel Programming
A compilation method for communication—efficient partitioning of DOALL loops
Compiler optimizations for scalable parallel systems
Compiler optimization of dynamic data distributions for distributed-memory multicomputers
Compiler optimizations for scalable parallel systems
Fortran RED - A Retargetable Environment for Automatic Data Layout
LCPC '98 Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing
Three-dimensional orthogonal tile sizing problem: mathematical programming approach
ASAP '97 Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures and Processors
Object-Distribution Analysis for Program Decomposition and Re-Clustering
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 13 - Volume 14
Optimizing the use of static buffers for DMA on a CELL chip
LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Hi-index | 0.00 |
The PARADIGM (PARAllelizing compiler for DIstributed-memory General-purpose Multicomputers) project at the University of Illinois provides a fully automated means to parallelize programs, written in a serial programming model, for execution on distributed-memory multicomputers. To provide efficient execution, PARADIGM automatically performs various optimizations to reduce the overhead and idle time caused by interprocessor communication. Optimizations studied in this paper include message coalescing, message vectorization, message aggregation, and coarse gram pipelining. To separate the optimization algorithms from machine-specific details, parameterized models are used to estimate communication and computation costs for a given machine. The models are also used in coarse gram pipelining to automatically select a task granularity that balances the available parallelism with the costs of communication. To determine the applicability of the optimizations on different machines, we analyzed their performance on an Intel iPSC/860, an Intel iPSC/2, and a Thinking Machines CM-5.