Alternating direction methods on multiprocessors
SIAM Journal on Scientific and Statistical Computing
Implementing the beam and warming method on the hypercube
C3P Proceedings of the third conference on Hypercube concurrent computers and applications - Volume 2
Process decomposition through locality of reference
PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
Updating distributed variables in local computations
Concurrency: Practice and Experience
Communication optimization and code generation for distributed memory machines
PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
The Omega Library interface guide
The Omega Library interface guide
Using integer sets for data-parallel program analysis and optimization
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
High performance Fortran compilation techniques for parallelizing scientific codes
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
MPI: The Complete Reference
Data-Parallel Compiler Support for Multipartitioning
Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Generalized Multipartitioning for Multi-Dimensional Arrays
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
COTS Clusters vs. the Earth Simulator: An Application Study Using IMPACT-3D
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Effective communication coalescing for data-parallel applications
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
The rise and fall of High Performance Fortran: an historical object lesson
Proceedings of the third ACM SIGPLAN conference on History of programming languages
Optimization techniques for efficient HTA programs
Parallel Computing
Hi-index | 0.01 |
Data parallel compilers have long aimed to equal the performance of carefully and-optimized parallel codes. For tightly-coupled applications based on line sweeps, this goal has been particularly elusive. In the Rice dHPF compiler, we have developed a wide spectrum of optimizations that enable us to closely approach hand-coded performance for tightly-coupled line sweep applications including the NAS SP and BT benchmark codes. From lightly-modified copies of standard serial versions of these benchmarks, dHPF generates MPI-based parallel code that is within 4% of the performance of the hand-crafted MPI implementations of thesecodes for a 1023 problem size (Class B) on 64 processors.We describe and quantitatively evaluate the impact of partitioning, communication and memory hierarchy optimizations implemented by dHPF that enable us to approach hand-coded performance with compiler-generated parallel code.