Efficient implementation of a 3-dimensional ADI method on the iPSC/860
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Co-array Fortran for parallel programming
ACM SIGPLAN Fortran Forum
MPI-The Complete Reference, Volume 1: The MPI Core
MPI-The Complete Reference, Volume 1: The MPI Core
Protocols and Strategies for Optimizing Performance of Remote Memory Operations on Clusters
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Proceedings of the 11 IPPS/SPDP'99 Workshops Held in Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing
A Multi-Platform Co-Array Fortran Compiler
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Experiences with co-array fortran on hardware shared memory platforms
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
ACM SIGMETRICS Performance Evaluation Review
Hi-index | 0.00 |
As part of the recent focus on increasing the productivity of parallel application developers, Co-array Fortran (CAF) has emerged as an appealing alternative to the Message Passing Interface (MPI). CAF belongs to the family of global address space parallel programming languages; such languages provide the abstraction of globally addressable memory accessed using one-sided communication. At Rice University we are developing caf c, an open source, multiplatform CAF compiler. Our earlier studies show that caf c-compiled CAF programs achieve similar performance to that of corresponding MPI codes for the NAS Parallel Benchmarks. In this paper, we present a study of several CAF implementations of Sweep3D on four modern architectures. We analyze the impact of using one-sided communication in Sweep3D, identify potential sources of inefficiencies and suggest ways to address them. Our results show that we achieve comparable performance to that of the MPI version on three cluster-based architectures and outperform it by up to 10 % on the SGI Altix 3000.