Application and architectural bottlenecks in large scale distributed shared memory machines

Authors:
Chris Holt;Jaswinder Pal Singh;John Hennessy
Affiliations:
Computer Systems Laboratory, Stanford University, Stanford, CA;Department of Computer Science, Princeton University, Princeton, NJ;Computer Systems Laboratory, Stanford University, Stanford, CA
Venue:
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Year:
1996

Citing 13
Cited 17

Tolerating latency through software-controlled prefetching in shared-memory multiprocessors

Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
The Stanford Dash Multiprocessor

Computer
The DASH prototype: implementation and performance

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Simulation of multiprocessors: accuracy and performance

Simulation of multiprocessors: accuracy and performance
Working sets, cache sizes, and node granularity issues for large-scale multiprocessors

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tempest and typhoon: user-level shared memory

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The performance advantages of integrating block data transfer in cache-coherent multiprocessors

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The performance impact of flexibility in the Stanford FLASH multiprocessor

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Fine-grain access control for distributed shared memory

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The MIT Alewife machine: architecture and performance

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Scaling Parallel Programs for Multiprocessors: Methodology and Examples

Computer
The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors

The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors

The interaction of parallel programming constructs and coherence protocols

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Application restructuring and performance portability on shared virtual memory and hardware-coherent multiprocessors

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
A methodology and an evaluation of the SGI Origin2000

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Analytic evaluation of shared-memory systems with ILP processors

Proceedings of the 25th annual international symposium on Computer architecture
Evaluating the Effect of Coherence Protocols on the Performance of Parallel Programming Constructs

International Journal of Parallel Programming
Scaling application performance on a cache-coherent multiprocessor

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Application scaling under shared virtual memory on a cluster of SMPs

ICS '99 Proceedings of the 13th international conference on Supercomputing
Quantitative Characterization and Analysis of the I/O Behavior of a Commercial Distributed-Shared-Memory Machine

IEEE Transactions on Parallel and Distributed Systems
Accelerating shared virtual memory via general-purpose network interface support

ACM Transactions on Computer Systems (TOCS)
Is data distribution necessary in OpenMP?

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Runtime vs. Manual Data Distribution for Architecture-Agnostic Shared-Memory Programming Models

International Journal of Parallel Programming
How Much Does Network Contention Affect Distributed Shared Memory Performance?

ICPP '97 Proceedings of the international Conference on Parallel Processing
Leveraging Transparent Data Distribution in OpenMP via User-Level Dynamic Page Migration

ISHPC '00 Proceedings of the Third International Symposium on High Performance Computing
Virtual memory on data diffusion architectures

Parallel Computing
A transparent runtime data distribution engine for OpenMP

Scientific Programming
A case study of trace-driven simulation for analyzing interconnection networks: cc-NUMAs with ILP processors

EURO-PDP'00 Proceedings of the 8th Euromicro conference on Parallel and distributed processing
Dual-layered file cache on cc-NUMA system

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many of the programming challenges encountered in small to moderate-scale hardware cache-coherent shared memory machines have been extensively studied. While work remains to be done, the basic techniques needed to efficiently program such machines have been well explored. Recently, a number of researchers have presented architectural techniques for scaling a cache coherent shared address space to much larger processor counts. In this paper, we examine the extent to which applications can achieve reasonable performance on such large-scale, cache-coherent, distributed shared address space machines, by determining the problems sizes needed to achieve a reasonable level of efficiency. We also look at how much programming effort and optimization is needed to achieve high efficiency, beyond that needed at small processor counts. For each application, we discuss the main architectural bottlenecks that prevent smaller problem sizes or less optimized programs from achieving good efficiency. Our results show that while there are some applications that either do not scale or must be heavily optimized to do so, for most of the applications we studied it is not necessary to heavily modify the code or restructure algorithms to scale well upto several hundred processors, once the basic techniques for load balancing and data locality are used that are needed for small-scale systems as well. Programs written with some care perform well without substantially compromising the ease of programming advantage of a shared address space, and the problem sizes required to achieve good performance are surprisingly small. It is important to be careful about how data structures and layouts interact with system granularities, but these optimizations are usually needed for moderate-scale machines as well.