Message passing and shared address space parallelism on an SMP cluster

Authors:
Hongzhang Shan;Jaswinder P. Singh;Leonid Oliker;Rupak Biswas
Affiliations:
NERSC, Lawrence Berkeley National Laboratory, Berkeley, CA;Department of Computer Science, Princeton University, Princeton, NJ;NERSC, Lawrence Berkeley National Laboratory, Berkeley, CA;NASA Advanced Supercomputing (NAS) Division, NASA Ames Research Center, Mail Stop T27A-1, Moffett Field, CA
Venue:
Parallel Computing
Year:
2003

Citing 22
Cited 11

Memory coherence in shared virtual memory systems

ACM Transactions on Computer Systems (TOCS)
Parallel Visualization Algorithms: Performance and Architectural Implications

Computer
Tempest and typhoon: user-level shared memory

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Fine-grain access control for distributed shared memory

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Shasta: a low overhead, software-only approach for supporting fine-grain shared memory

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Quantifying the performance differences between PVM and TreadMarks

Journal of Parallel and Distributed Computing
Application restructuring and performance portability on shared virtual memory and hardware-coherent multiprocessors

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Using network interface support to avoid asynchronous protocol processing in shared virtual memory systems

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Scaling application performance on a cache-coherent multiprocessor

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Application scaling under shared virtual memory on a cluster of SMPs

ICS '99 Proceedings of the 13th international conference on Supercomputing
Optimization of MPI collectives on clusters of large-scale SMP's

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Parallel sorting on cache-coherent DSM multiprocessors

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Parallelization of a Dynamic Unstructured Algorithm Using Three Leading Programming Paradigms

IEEE Transactions on Parallel and Distributed Systems
Parallel tetrahedral mesh adaptation with dynamic load balancing

Parallel Computing - Special issue on graph partioning and parallel computing
Performance of hybrid message-passing and shared-memory parallelism for discrete element modeling

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
MPI versus MPI+OpenMP on IBM SP for the NAS benchmarks

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations

SIAM Review
A comparison of three programming models for adaptive applications on the origin2000

Journal of Parallel and Distributed Computing
A Comparison of MPI, SHMEM and Cache-Coherent Shared Address Space Programming Models on a Tightly-Coupled Multiprocessors

International Journal of Parallel Programming
Myrinet: A Gigabit-per-Second Local Area Network

IEEE Micro
A Comparative Characterization of Communication Patterns in Applications Using MPI and Shared Memory on an IBM SP2

CANPC '98 Proceedings of the Second International Workshop on Network-Based Parallel Computing: Communication, Architecture, and Applications
Parallel Tree Building on a Range of Shared Address Space Multiprocessors: Algorithms and Application Performance

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium

ParADE: An OpenMP Programming Environment for SMP Cluster Systems

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
High performance noise reduction for biomedical multidimensional data

Digital Signal Processing
De Novo Ultrascale Atomistic Simulations On High-End Parallel Supercomputers

International Journal of High Performance Computing Applications
Efficient Adaptive Algorithms for Transposing Small and Large Matrices on Symmetric Multiprocessors

Informatica
Overcoming performance bottlenecks in using OpenMP on SMP clusters

Parallel Computing
Efficient shared memory and RDMA based collectives on multi-rail QsNetII SMP clusters

Cluster Computing
Scopira: an open source C++ framework for biomedical data analysis applications

Software—Practice & Experience
A speculative and adaptive MPI rendezvous protocol over RDMA-enabled interconnects

International Journal of Parallel Programming
Designing an efficient partitioning algorithm for grid environments with application to N-body problems

ICCSA'03 Proceedings of the 2003 international conference on Computational science and its applications: PartII
Parallelization methods for implementation of discharge simulation along resin insulator surfaces

Computers and Electrical Engineering
A fast and resource-conscious MPI message queue mechanism for large-scale jobs

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Currently, message passing (MP) and shared address space (SAS) are the two leading parallel programming paradigms. MP has been standardized with MPI, and is the more common and mature approach; however, code development can be extremely difficult, especially for irregularly structured computations. SAS offers substantial ease of programming, but may suffer from performance limitations due to poor spatial locality and high protocol overhead. In this paper, we compare the performance of and the programming effort required for six applications under both programming models on a 32-processor PC-SMP cluster, a platform that is becoming increasingly attractive for high-end scientific computing. Our application suite consists of codes that typically do not exhibit scalable performance under shared-memory programming due to their high communication-to-computation ratios and/or complex commumcation patterns. Results indicate that SAS can achieve about half the parallel efficiency of MPI for most of our applications, while being competitive for the others. A hybrid MPI + SAS strategy shows only a small performance advantage over pure MPI in some cases. Finally, improved implementations of two MPI collective operations on PC-SMP clusters are presented.