Architecture and design of AlphaServer GS320

Authors:
Kourosh Gharachorloo;Madhu Sharma;Simon Steely;Stephen Van Doren
Affiliations:
Western Research Laboratory, Compaq Computer Corporation, Palo Alto, California;High Performance Servers Division, Compaq Computer Corporation Marlborough, Massachusetts;High Performance Servers Division, Compaq Computer Corporation Marlborough, Massachusetts;High Performance Servers Division, Compaq Computer Corporation Marlborough, Massachusetts
Venue:
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Year:
2000

Citing 26
Cited 44

A lazy cache algorithm

SPAA '89 Proceedings of the first annual ACM symposium on Parallel algorithms and architectures
Race-free interconnection networks and multiprocessor consistency

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Delayed consistency and its effects on the miss rate of parallel programs

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Alpha architecture reference manual

Alpha architecture reference manual
Lazy caching

ACM Transactions on Programming Languages and Systems (TOPLAS)
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Shasta: a low overhead, software-only approach for supporting fine-grain shared memory

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Operating system support for improving data locality on CC-NUMA compute servers

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Memory consistency models for shared-memory multiprocessors

Memory consistency models for shared-memory multiprocessors
Design and performance of the Shasta distributed shared memory protocol

ICS '97 Proceedings of the 11th international conference on Supercomputing
Hardware fault containment in scalable shared-memory multiprocessors

Proceedings of the 24th annual international symposium on Computer architecture
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
Disco: running commodity operating systems on scalable multiprocessors

Proceedings of the sixteenth ACM symposium on Operating systems principles
Towards transparent and efficient software distributed shared memory

Proceedings of the sixteenth ACM symposium on Operating systems principles
Memory system characterization of commercial workloads

Proceedings of the 25th annual international symposium on Computer architecture
Is SC + ILP = RC?

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Multicast snooping: a new coherence method using a multicast address network

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
The directory-based cache coherence protocol for the DASH multiprocessor

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Piranha: a scalable architecture based on single-chip multiprocessing

Proceedings of the 27th annual international symposium on Computer architecture
Measuring memory hierarchy performance of cache-coherent multiprocessors using micro benchmarks

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
A Single-Chip Multiprocessor

Computer
The Future of Systems Research

Computer
Starfire: Extending the SMP Envelope

IEEE Micro
The Alpha 21264 Microprocessor

IEEE Micro
WildFire: A Scalable Path for SMPs

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
The Scalable Coherent Interface (SCI)

IEEE Communications Magazine

Timestamp snooping: an approach for extending SMPs

ACM SIGPLAN Notices
Timestamp snooping: an approach for extending SMPs

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Leveraging cache coherence in active memory systems

ICS '02 Proceedings of the 16th international conference on Supercomputing
The sun fireplane system interconnect

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
A comparative study of arbitration algorithms for the Alpha 21364 pipelined router

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
The Sun Fireplane Interconnect

IEEE Micro
Checking Cache-Coherence Protocols with TLA+

Formal Methods in System Design
Speculative Sequential Consistency with Little Custom Storage

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
The Use of Prediction for Accelerating Upgrade Misses in cc-NUMA Multiprocessors

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Owner prediction for accelerating cache-to-cache transfer misses in a cc-NUMA architecture

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Efficient synchronization for nonuniform communication architectures

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Quantifying instruction criticality for shared memory multiprocessors

Proceedings of the fifteenth annual ACM symposium on Parallel algorithms and architectures
Inferential queueing and speculative push for reducing critical communication latencies

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Hierarchical Backoff Locks for Nonuniform Communication Architectures

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Memory System Behavior of Java-Based Middleware

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors

Proceedings of the 30th annual international symposium on Computer architecture
Latency, Occupancy, and Bandwidth in DSM Multiprocessors: A Performance Evaluation

IEEE Transactions on Computers
Quantifying contention and balancing memory load on hardware DSM multiprocessors

Journal of Parallel and Distributed Computing - Special section best papers from the 2002 international parallel and distributed processing symposium
The Impact of Negative Acknowledgments in Shared Memory Scientific Applications

IEEE Transactions on Parallel and Distributed Systems
Architectural Support for Uniprocessor and Multiprocessor Active Memory Systems

IEEE Transactions on Computers
Exploring Virtual Network Selection Algorithms in DSM Cache Coherence Protocols

IEEE Transactions on Parallel and Distributed Systems
A Two-Level Directory Architecture for Highly Scalable cc-NUMA Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Using Hardware Counters to Automatically Improve Memory Performance

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Cache coherence support for non-shared bus architecture on heterogeneous MPSoCs

Proceedings of the 42nd annual Design Automation Conference
Microarchitecture of a High-Radix Router

Proceedings of the 32nd annual international symposium on Computer Architecture
The architecture of the HP Superdome shared-memory multiprocessor

Proceedings of the 19th annual international conference on Supercomputing
Reducing Server Data Traffic Using a Hierarchical Computation Model

IEEE Transactions on Parallel and Distributed Systems
Formal Verification and its Impact on the Snooping versus Directory Protocol Debate

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Inferential queueing and speculative push

International Journal of Parallel Programming - Special issue I: The 17th annual international conference on supercomputing (ICS'03)
Specifying and verifying systems with TLA+

EW 10 Proceedings of the 10th workshop on ACM SIGOPS European workshop
Conditional Memory Ordering

Proceedings of the 33rd annual international symposium on Computer Architecture
Coherence Ordering for Ring-based Chip Multiprocessors

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Scaling non-regular shared-memory codes by reusing custom loop schedules

Scientific Programming - OpenMP
Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
A two-level directory organization solution for CC-NUMA systems

ICA3PP'07 Proceedings of the 7th international conference on Algorithms and architectures for parallel processing
Efficient methods for formally verifying safety properties of hierarchical cache coherence protocols

Formal Methods in System Design
Token tenure and PATCH: A predictive/adaptive token-counting hybrid

ACM Transactions on Architecture and Code Optimization (TACO)
Fractal Coherence: Scalably Verifiable Cache Coherence

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
A case for globally shared-medium on-chip interconnect

Proceedings of the 38th annual international symposium on Computer architecture
Speeding-up synchronizations in DSM multiprocessors

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
A novel lightweight directory architecture for scalable shared-memory multiprocessors

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Enhancing effective throughput for transmission line-based bus

Proceedings of the 39th Annual International Symposium on Computer Architecture
Using in-flight chains to build a scalable cache coherence protocol

ACM Transactions on Architecture and Code Optimization (TACO)
Exploiting replication to improve performances of NUCA-based CMP systems

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper describes the architecture and implementation of the AlphaServer GS320, a cache-coherent non-uniform memory access multiprocessor developed at Compaq. The AlphaServer GS320 architecture is specifically targeted at medium-scale multiprocessing with 32 to 64 processors. Each node in the design consists of four Alpha 21264 processors, up to 32GB of coherent memory, and an aggressive IO subsystem. The current implementation supports up to 8 such nodes for a total of 32 processors. While snoopy-based designs have been stretched to medium-scale multiprocessors by some vendors, providing sufficient snoop bandwidth remains a major challenge especially in systems with aggressive processors. At the same time, directory protocols targeted at larger scale designs lead to a number of inherent inefficiencies relative to snoopy designs. A key goal of the AlphaServer GS320 architecture has been to achieve the best-of-both-worlds, partly by exploiting the bounded scale of the target systems.This paper focuses on the unique design features used in the AlphaServer GS320 to efficiently implement coherence and consistency. The guiding principle for our directory-based protocol is to address correctness issues related to rare protocol races without burdening the common transaction flows. Our protocol exhibits lower occupancy and lower message counts compared to previous designs, and provides more efficient handling of 3-hop transactions. Furthermore, our design naturally lends itself to elegant solutions for deadlock, livelock, starvation, and fairness. The AlphaServer GS320 architecture also incorporates a couple of innovative techniques that extend previous approaches for efficiently implementing memory consistency models. These techniques allow us to generate commit events (which are used for ordering purposes) well in advance of formulating the reply to a transaction. Furthermore, the separation of the commit event allows time-critical replies to by-pass inbound requests without violating ordering properties. Even though our design specifically targets medium-scale servers, many of the same techniques can be applied to larger-scale directory-based and smaller-scale snoopy-based designs. Finally, we evaluate the performance impact of some of the above optimizations and present a few competitive benchmark results.