Parallelizing complex scans and reductions
PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Commutativity analysis: a new analysis framework for parallelizing compilers
PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Detection and global optimization of reduction operations for distributed parallel machines
ICS '96 Proceedings of the 10th international conference on Supercomputing
A programmer's guide to ZPL
Regions: an abstraction for expressing array computation
Proceedings of the conference on APL '99 : On track to the 21st century: On track to the 21st century
APL '98 Proceedings of the APL98 conference on Array processing language
A comparative study of the NAS MG benchmark across parallel languages and architectures
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
MPI: The Complete Reference
Polaris: Improving the Effectiveness of Parallelizing Compilers
LCPC '94 Proceedings of the 7th International Workshop on Languages and Compilers for Parallel Computing
On the Complexity of Commutativity Analysis
COCOON '96 Proceedings of the Second Annual International Conference on Computing and Combinatorics
ZPL's WYSIWYG Performance Model
HIPS '98 Proceedings of the High-Level Parallel Programming Models and Supportive Environments
NESL: A Nested Data-Parallel Language
NESL: A Nested Data-Parallel Language
Compiler Optimization of Implicit Reductions for Distributed Memory Multiprocessors
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
The design and implementation of a parallel array operator for the arbitrary remapping of data
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Using semi-lagrangian formulations with automatic code generation for environmental modeling
Proceedings of the 2004 ACM symposium on Applied computing
Global-view abstractions for user-defined reductions and scans
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
The design and development of ZPL
Proceedings of the third ACM SIGPLAN conference on History of programming languages
Tile Reduction: The First Step towards Tile Aware Parallelization in OpenMP
IWOMP '09 Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism
Code generation for semi-lagrangian formulations
ISPDC'03 Proceedings of the Second international conference on Parallel and distributed computing
Parallelization of DNA sequence alignment using OpenMP
Proceedings of the 2011 International Conference on Communication, Computing & Security
A proposal for user-defined reductions in OpenMP
IWOMP'10 Proceedings of the 6th international conference on Beyond Loop Level Parallelism in OpenMP: accelerators, Tasking and more
An approach for semiautomatic locality optimizations using OpenMP
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Hi-index | 0.00 |
The optimized handling of reductions on parallel supercomputers or clusters of workstations is critical to high performance because reductions are common in scientific codes and a potential source of bottlenecks. Yet in many high-level languages, a mechanism for writing efficient reductions remains surprisingly absent. Further, when such mechanisms do exist, they often do not provide the flexibility a programmer needs to achieve a desirable level of performance. In this paper, we present a new language construct for arbitrary reductions that lets a programmer achieve a level of performance equal to that achievable with the highly flexible, but low-level combination of Fortran and MPI. We have implemented this construct in the ZPL language and evaluate it in the context of the initialization of the NAS MG benchmark. We show a 45 times speedup over the same code written in ZPL without this construct. In addition, performance on a large number of processors surpasses that achieved in the NAS implementation showing that our mechanism provides programmers with the needed flexibility.