A Comparison of MPI, SHMEM and Cache-Coherent Shared Address Space Programming Models on a Tightly-Coupled Multiprocessors

Authors:
Hongzhang Shan;Jaswinder Pal Singh
Affiliations:
Department of Computer Science, Princeton University. shz@cs.princeton.edu;Department of Computer Science, Princeton University. jps@cs.princeton.edu
Venue:
International Journal of Parallel Programming
Year:
2001

Citing 15
Cited 6

FFTs in external or hierarchical memory

The Journal of Supercomputing
A comparison of sorting algorithms for the connection machine CM-2

SPAA '91 Proceedings of the third annual ACM symposium on Parallel algorithms and architectures
Integrating message-passing and shared-memory: early experience

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
On the versatility of parallel sorting by regular sampling

Parallel Computing
Parallel Visualization Algorithms: Performance and Architectural Implications

Computer
A comparison of message passing and shared memory architectures for data parallel programs

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Where is time spent in message-passing and shared-memory programs?

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The performance advantages of integrating block data transfer in cache-coherent multiprocessors

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Implications of hierarchical N-body methods for multiprocessor architectures

ACM Transactions on Computer Systems (TOCS)
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Quantifying the performance differences between PVM and TreadMarks

Journal of Parallel and Distributed Computing
IO-lite: a unified I/O buffering and caching system

OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
Scaling application performance on a cache-coherent multiprocessor

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
A Comparative Characterization of Communication Patterns in Applications Using MPI and Shared Memory on an IBM SP2

CANPC '98 Proceedings of the Second International Workshop on Network-Based Parallel Computing: Communication, Architecture, and Applications
Programming FFT on DSM Multiprocessors

HPC '00 Proceedings of the The Fourth International Conference on High-Performance Computing in the Asia-Pacific Region-Volume 2 - Volume 2

A comparison of three programming models for adaptive applications on the origin2000

Journal of Parallel and Distributed Computing
Message passing and shared address space parallelism on an SMP cluster

Parallel Computing
Assessing the potential of hybrid hpc systems for scientific applications: a case study

Proceedings of the 4th international conference on Computing frontiers
Experiences using OpenMP based on compiler directed software DSM on a PC cluster

WOMPAT'03 Proceedings of the OpenMP applications and tools 2003 international conference on OpenMP shared memory parallel programming
What multilevel parallel programs do when you are not watching: a performance analysis case study comparing MPI/OpenMP, MLP, and nested OpenMP

WOMPAT'04 Proceedings of the 5th international conference on OpenMP Applications and Tools: shared Memory Parallel Programming with OpenMP
Remote store programming: a memory model for embedded multicore

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers

Quantified Score

Hi-index	0.02

Visualization

Abstract

We compare the performance of three major programming models on a modern, 64-processor hardware cache-coherent machine, one of the two major types of platforms upon which high-performance computing is converging. We focus on applications that are either regular, predictable or at least do not require fine-grained dynamic replication of irregularly accessed data. Within this class, we use programs with a range of important communication patterns. We examine whether the basic parallel algorithm and communication structuring approaches needed for best performance are similar or different among the models, whether some models have substantial performance advantages over others as problem size and number of processors change, what the sources of these performance differences are, where the programs spend their time, and whether substantial improvements can be obtained by modifying either the application programming interfaces or the implementations of the programming models on this type of tightly-coupled multiprocessor platform.