A bridging model for parallel computation
Communications of the ACM
Introduction to algorithms
LogP: towards a realistic model of parallel computation
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Scalable parallel geometric algorithms for coarse grained multicomputers
SCG '93 Proceedings of the ninth annual symposium on Computational geometry
Direct bulk-synchronous parallel algorithms
Journal of Parallel and Distributed Computing
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
Deterministic sorting and randomized median finding on the BSP model
Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
Can shared-memory model serve as a bridging model for parallel computation?
Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
BSPlib: The BSP programming library
Parallel Computing
The Paderborn University BSP (PUB) library
Parallel Computing
Concurrency and Computation: Practice & Experience
Coprocessor design to support MPI primitives in configurable multiprocessors
Integration, the VLSI Journal
Measurement of the latency parameters of the Multi-BSP model: a multicore benchmarking approach
The Journal of Supercomputing
Hi-index | 0.00 |
In this work we make a strong case for remote memory access (RMA) as the effective way to program a parallel computer by proposing a framework that supports RMA in a library independent, simple and intuitive way. If one uses our approach the parallel code one writes will run transparently under MPI-2 enabled libraries but also bulk-synchronous parallel libraries. The advantage of using RMA is code simplicity, reduced programming complexity, and increased efficiency. We support the latter claims by implementing under this framework a collection of benchmark programs consisting of a communication and synchronization performance assessment program, a dense matrix multiplication algorithm, and two variants of a parallel radix-sort algorithm and examine their performance on a LINUX-based PC cluster under three different RMA enabled libraries: LAM MPI, BSPlib, and PUB. We conclude that implementations of such parallel algorithms using RMA communication primitives lead to code that is as efficient as the message-passing equivalent code and in the case of radix-sort substantially more efficient. In addition our work can be used as a comparative study of the relevant capabilities of the three libraries.