The Impact of Negative Acknowledgments in Shared Memory Scientific Applications

Authors:
Mainak Chaudhuri;Mark Heinrich
Affiliations:
-;-
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2004

Citing 23
Cited 2

Efficient synchronization primitives for large-scale cache-coherent multiprocessors

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
LimitLESS directories: A scalable cache coherence scheme

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The Stanford Dash Multiprocessor

Computer
DDM: A Cache-Only Memory Architecture

Computer
Cache coherence directories for scalable multiprocessors

Cache coherence directories for scalable multiprocessors
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The performance impact of flexibility in the Stanford FLASH multiprocessor

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
STiNG: a CC-NUMA computer system for the commercial marketplace

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Efficient synchronization: let them eat QOLB

Proceedings of the 24th annual international symposium on Computer architecture
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
Memory system characterization of commercial workloads

Proceedings of the 25th annual international symposium on Computer architecture
Performance of database workloads on shared-memory systems with out-of-order processors

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
A Quantitative Analysis of the Performance and Scalability of Distributed Shared Memory Cache Coherence Protocols

IEEE Transactions on Computers - Special issue on cache memory and related problems
The directory-based cache coherence protocol for the DASH multiprocessor

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Piranha: a scalable architecture based on single-chip multiprocessing

Proceedings of the 27th annual international symposium on Computer architecture
Architecture and design of AlphaServer GS320

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
FLASH vs. (Simulated) FLASH: closing the simulation loop

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Spider: A High-Speed Network Interconnect

IEEE Micro
WildFire: A Scalable Path for SMPs

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Ocean warning: avoid drowning

ACM SIGARCH Computer Architecture News
The performance and scalability of distributed shared-memory cache coherence protocols

The performance and scalability of distributed shared-memory cache coherence protocols
Latency, Occupancy, and Bandwidth in DSM Multiprocessors: A Performance Evaluation

IEEE Transactions on Computers

Exploring Virtual Network Selection Algorithms in DSM Cache Coherence Protocols

IEEE Transactions on Parallel and Distributed Systems
Speeding-up synchronizations in DSM multiprocessors

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Abstract--Negative ACKnowledgments (NACKs) and subsequent retries, used to resolve races and to enforce a total order among shared memory accesses in distributed shared memory (DSM) multiprocessors, not only introduce extra network traffic and contention, but also increase node controller occupancy, especially at the home. In this paper, we present possible protocol optimizations to minimize these retries and offer a thorough study of the performance effects of these messages on six scalable scientific applications running on 64-node systems and larger. To eliminate NACKs, we present a mechanism to queue pending requests at the main memory of the home node and augment it with a novel technique of combining pending read requests, thereby accelerating the parallel execution for 64 nodes by as much as 41 percent (a speedup of 1.41) compared to a modified version of the SGI Origin 2000 protocol. We further design and evaluate a protocol by combining this mechanism with a technique that we call write string forwarding, used in the AlphaServer GS320 and Piranha systems. We find that without careful design considerations, especially regarding atomic read-modify-write operations, this aggressive write forwarding can hurt performance. We identify and evaluate the necessary micro-architectural support to solve this problem. We compare the performance of these novel NACK-free protocols with a base bitvector protocol, a modified version of the SGI Origin 2000 protocol, and a NACK-free protocol that uses dirty sharing and write string forwarding as in the Piranha system. To understand the effects of network speed and topology the evaluation is carried out on three network configurations.