Application restructuring and performance portability on shared virtual memory and hardware-coherent multiprocessors

Authors:
Dongming Jiang;Hongzhang Shan;Jaswinder Pal Singh
Affiliations:
Department of Computer Science, 35 Olden Street, Princeton University, Princeton, NJ;Department of Computer Science, 35 Olden Street, Princeton University, Princeton, NJ;Department of Computer Science, 35 Olden Street, Princeton University, Princeton, NJ
Venue:
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
1997

Citing 18
Cited 25

The DASH prototype: implementation and performance

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
An empirical comparison of the Kendall Square Research KSR-1 and Stanford DASH multiprocessors

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Parallel Visualization Algorithms: Performance and Architectural Implications

Computer
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tempest and typhoon: user-level shared memory

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The performance impact of flexibility in the Stanford FLASH multiprocessor

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The MIT Alewife machine: architecture and performance

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Understanding application performance on shared virtual memory systems

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Application and architectural bottlenecks in large scale distributed shared memory machines

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Shasta: a low overhead, software-only approach for supporting fine-grain shared memory

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Scope consistency: a bridge between release consistency and entry consistency

Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Fast volume rendering using a shear-warp factorization of the viewing transformation

Fast volume rendering using a shear-warp factorization of the viewing transformation
Improving parallel shear-warp volume rendering on shared address space multiprocessors

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Application-specific protocols for user-level shared memory

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Myrinet: A Gigabit-per-Second Local Area Network

IEEE Micro
An Evaluation of a Commercial CC-NUMA Architecture: The CONVEX Exemplar SPP1200

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing

Monitoring shared virtual memory performance on a Myrinet-based PC cluster

ICS '98 Proceedings of the 12th international conference on Supercomputing
Evaluation of hardware write propagation support for next-generation shared virtual memory clusters

ICS '98 Proceedings of the 12th international conference on Supercomputing
A methodology and an evaluation of the SGI Origin2000

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Using network interface support to avoid asynchronous protocol processing in shared virtual memory systems

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Scaling application performance on a cache-coherent multiprocessor

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Application scaling under shared virtual memory on a cluster of SMPs

ICS '99 Proceedings of the 13th international conference on Supercomputing
Comparative study of page-based and segment-based software DSM through compiler optimization

Proceedings of the 14th international conference on Supercomputing
Program transformation and runtime support for threaded MPI execution on shared-memory machines

ACM Transactions on Programming Languages and Systems (TOPLAS)
Accelerating shared virtual memory via general-purpose network interface support

ACM Transactions on Computer Systems (TOCS)
Improving fine-grained irregular shared-memory benchmarks by data reordering

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
The effects of communication parameters on end performance of shared virtual memory clusters

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Multi-protocol active messages on a cluster of SMP's

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Design issues for a high-performance distributed shared memory on symmetrical multiprocessor clusters

Cluster Computing
An Effective Logical Cache for a Clustered LRC-Based DSM System

Cluster Computing
Shared Virtual Memory Clusters with Next-Generation Interconnection Networks and Wide Compute Nodes

HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
CableS: Thread Control and Memory System Extensions for Shared Virtual Memory Clusters

WOMPAT '01 Proceedings of the International Workshop on OpenMP Applications and Tools: OpenMP Shared Memory Parallel Programming
Efficient Categorization of Sharing Patterns in Software DSM Systems

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Message passing and shared address space parallelism on an SMP cluster

Parallel Computing
Dynamic Data Replication: An Approach to Providing Fault-Tolerant Shared Memory Clusters

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Evaluation of Compiler-Assisted Software DSM Schemes for a Workstation Cluster

IWIA '99 Proceedings of the 1999 International Workshop on Innovative Architecture
Parallel Tree Building on a Range of Shared Address Space Multiprocessors: Algorithms and Application Performance

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Using System Emulation to Model Next-Generation Shared Virtual Memory Clusters

Cluster Computing
Shared virtual memory clusters: bridging the cost-performance gap between SMPs and hardware DSM systems

Journal of Parallel and Distributed Computing
Fast and transparent recovery for continuous availability of cluster-based servers

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Addressing a workload characterization study to the design of consistency protocols

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The performance portability of parallel programs across a wide range of emerging coherent shared address space systems is not well understood. Programs that run well on efficient, hardware cache-coherent systems often do not perform well on less optimal or more commodity-based communication architectures. This paper studies this issue of performance portability, with the commodity communication architecture of interest being page-grained shared virtual memory. We begin with applications that perform well on moderat scale hardware cache-coherent systems, and find that they do not do so well on SVM systems. Then, we examine whether and how the applications can be improved for SVM systems --- through data structuring or algorithmic enhancements---and the nature and difficulty of the optimization. Finally, we examine the impact of the successful optimizations on hardware-coherent platforms themselves, to see whether they are helpful, harmful or neutral on those platforms. We develop a systematic methodology to explore optimizations in different structured classes. The results, and the difficulty of the optimizations, lead insight not only into performance portability but also into the viability of SVM as a platform for these types of applications.