The performance advantages of integrating block data transfer in cache-coherent multiprocessors

Authors:
Steven Cameron Woo;Jaswinder Pal Singh;John L. Hennessy
Affiliations:
Computer Systems Laboratory, Stanford University, Stanford, CA;Computer Systems Laboratory, Stanford University, Stanford, CA;Computer Systems Laboratory, Stanford University, Stanford, CA
Venue:
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Year:
1994

Citing 13
Cited 23

FFTs in external or hierarchical memory

The Journal of Supercomputing
A comparison of sorting algorithms for the connection machine CM-2

SPAA '91 Proceedings of the third annual ACM symposium on Parallel algorithms and architectures
Integrating message-passing and shared-memory: early experience

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Simulation of multiprocessors: accuracy and performance

Simulation of multiprocessors: accuracy and performance
Evaluation of release consistent software distributed shared memory on emerging network technology

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tempest and typhoon: user-level shared memory

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Integration of message passing and shared memory in the Stanford FLASH multiprocessor

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
APRIL: a processor architecture for multiprocessing

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
The directory-based cache coherence protocol for the DASH multiprocessor

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Scaling Parallel Programs for Multiprocessors: Methodology and Examples

Computer
SPLASH: Stanford parallel applications for shared-memory*

SPLASH: Stanford parallel applications for shared-memory*
The Performance Advantages of Integrating Message Passing in Cache-Coherent Multiprocessors

The Performance Advantages of Integrating Message Passing in Cache-Coherent Multiprocessors

The performance impact of flexibility in the Stanford FLASH multiprocessor

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Architectural mechanisms for explicit communication in shared memory multiprocessors

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Speeding up irregular applications in shared-memory multiprocessors: memory binding and group prefetching

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Application and architectural bottlenecks in large scale distributed shared memory machines

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
An integrated compile-time/run-time software distributed shared memory system

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
SoftFLASH: analyzing the performance of clustered distributed virtual shared memory

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Performance implications of communication mechanisms in all-software global address space systems

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Hardware Support for Flexible Distributed Shared Memory

IEEE Transactions on Computers
Using network interface support to avoid asynchronous protocol processing in shared virtual memory systems

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
A comparison of MPI, SHMEM and cache-coherent shared address space programming models on the SGI Origin2000

ICS '99 Proceedings of the 13th international conference on Supercomputing
Reducing coherence overhead of barrier synchronization in software DSMs

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Integrating non-blocking synchronisation in parallel applications: performance advantages and methodologies

WOSP '02 Proceedings of the 3rd international workshop on Software and performance
A Comparison of MPI, SHMEM and Cache-Coherent Shared Address Space Programming Models on a Tightly-Coupled Multiprocessors

International Journal of Parallel Programming
Coherent Block Data Transfer in the FLASH Multiprocessor

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
A Multiprotocol Communication Support for the Global Address Space Programming Model on the IBM SP

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
CableS: Thread Control and Memory System Extensions for Shared Virtual Memory Clusters

WOMPAT '01 Proceedings of the International Workshop on OpenMP Applications and Tools: OpenMP Shared Memory Parallel Programming
Latency, Occupancy, and Bandwidth in DSM Multiprocessors: A Performance Evaluation

IEEE Transactions on Computers
Shared virtual memory clusters: bridging the cost-performance gap between SMPs and hardware DSM systems

Journal of Parallel and Distributed Computing
Synchronization coherence: A transparent hardware mechanism for cache coherence and fine-grained synchronization

Journal of Parallel and Distributed Computing
Process scheduling for future multicore processors

Proceedings of the Fifth International Workshop on Interconnection Network Architecture: On-Chip, Multi-Chip
A minimal average accessing time scheduler for multicore processors

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
Optimized multicore architectures for data parallel fast Fourier transform

Proceedings of the 14th International Conference on Computer Systems and Technologies

Quantified Score

Hi-index	0.01

Visualization

Abstract

Integrating support for block data transfer has become an important emphasis in recent cache-coherent shared address space multiprocessors. This paper examines the potential performance benefits of adding this support. A set of ambitious hardware mechanisms is used to study performance gains in five important scientific computations that appear to be good candidates for using block transfer. Our conclusion is that the benefits of block transfer are not substantial for hardware cache-coherent multiprocessors. The main reasons for this are (i) the relatively modest fraction of time applications spend in communication amenable to block transfer, (ii) the difficulty of finding enough independent computation to overlap with the communication latency that remains after block transfer, and (iii) long cache lines often capture many of the benefits of block transfer in efficient cache-coherent machines. In the cases where block transfer improves performance, prefetching can often provide comparable, if not superior, performance benefits. We also examine the impact of varying important communication parameters and processor speed on the effectiveness of block transfer, and comment on useful features that a block transfer facility should support for real applications.