SPAA '89 Proceedings of the first annual ACM symposium on Parallel algorithms and architectures
Race-free interconnection networks and multiprocessor consistency
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Delayed consistency and its effects on the miss rate of parallel programs
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Alpha architecture reference manual
Alpha architecture reference manual
ACM Transactions on Programming Languages and Systems (TOPLAS)
The Stanford FLASH multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Shasta: a low overhead, software-only approach for supporting fine-grain shared memory
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Operating system support for improving data locality on CC-NUMA compute servers
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Memory consistency models for shared-memory multiprocessors
Memory consistency models for shared-memory multiprocessors
Design and performance of the Shasta distributed shared memory protocol
ICS '97 Proceedings of the 11th international conference on Supercomputing
Hardware fault containment in scalable shared-memory multiprocessors
Proceedings of the 24th annual international symposium on Computer architecture
The SGI Origin: a ccNUMA highly scalable server
Proceedings of the 24th annual international symposium on Computer architecture
Disco: running commodity operating systems on scalable multiprocessors
Proceedings of the sixteenth ACM symposium on Operating systems principles
Towards transparent and efficient software distributed shared memory
Proceedings of the sixteenth ACM symposium on Operating systems principles
Memory system characterization of commercial workloads
Proceedings of the 25th annual international symposium on Computer architecture
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Multicast snooping: a new coherence method using a multicast address network
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
The directory-based cache coherence protocol for the DASH multiprocessor
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Piranha: a scalable architecture based on single-chip multiprocessing
Proceedings of the 27th annual international symposium on Computer architecture
Measuring memory hierarchy performance of cache-coherent multiprocessors using micro benchmarks
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Computer
The Future of Systems Research
Computer
Starfire: Extending the SMP Envelope
IEEE Micro
The Alpha 21264 Microprocessor
IEEE Micro
WildFire: A Scalable Path for SMPs
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
The Scalable Coherent Interface (SCI)
IEEE Communications Magazine
Timestamp snooping: an approach for extending SMPs
ACM SIGPLAN Notices
Timestamp snooping: an approach for extending SMPs
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Leveraging cache coherence in active memory systems
ICS '02 Proceedings of the 16th international conference on Supercomputing
The sun fireplane system interconnect
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
A comparative study of arbitration algorithms for the Alpha 21364 pipelined router
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
The Sun Fireplane Interconnect
IEEE Micro
Checking Cache-Coherence Protocols with TLA+
Formal Methods in System Design
Speculative Sequential Consistency with Little Custom Storage
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
The Use of Prediction for Accelerating Upgrade Misses in cc-NUMA Multiprocessors
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Owner prediction for accelerating cache-to-cache transfer misses in a cc-NUMA architecture
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Efficient synchronization for nonuniform communication architectures
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Quantifying instruction criticality for shared memory multiprocessors
Proceedings of the fifteenth annual ACM symposium on Parallel algorithms and architectures
Inferential queueing and speculative push for reducing critical communication latencies
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Hierarchical Backoff Locks for Nonuniform Communication Architectures
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Memory System Behavior of Java-Based Middleware
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Proceedings of the 30th annual international symposium on Computer architecture
Latency, Occupancy, and Bandwidth in DSM Multiprocessors: A Performance Evaluation
IEEE Transactions on Computers
Quantifying contention and balancing memory load on hardware DSM multiprocessors
Journal of Parallel and Distributed Computing - Special section best papers from the 2002 international parallel and distributed processing symposium
The Impact of Negative Acknowledgments in Shared Memory Scientific Applications
IEEE Transactions on Parallel and Distributed Systems
Architectural Support for Uniprocessor and Multiprocessor Active Memory Systems
IEEE Transactions on Computers
Exploring Virtual Network Selection Algorithms in DSM Cache Coherence Protocols
IEEE Transactions on Parallel and Distributed Systems
A Two-Level Directory Architecture for Highly Scalable cc-NUMA Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
Using Hardware Counters to Automatically Improve Memory Performance
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Cache coherence support for non-shared bus architecture on heterogeneous MPSoCs
Proceedings of the 42nd annual Design Automation Conference
Microarchitecture of a High-Radix Router
Proceedings of the 32nd annual international symposium on Computer Architecture
The architecture of the HP Superdome shared-memory multiprocessor
Proceedings of the 19th annual international conference on Supercomputing
Reducing Server Data Traffic Using a Hierarchical Computation Model
IEEE Transactions on Parallel and Distributed Systems
Formal Verification and its Impact on the Snooping versus Directory Protocol Debate
ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Inferential queueing and speculative push
International Journal of Parallel Programming - Special issue I: The 17th annual international conference on supercomputing (ICS'03)
Specifying and verifying systems with TLA+
EW 10 Proceedings of the 10th workshop on ACM SIGOPS European workshop
Proceedings of the 33rd annual international symposium on Computer Architecture
Coherence Ordering for Ring-based Chip Multiprocessors
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Scaling non-regular shared-memory codes by reusing custom loop schedules
Scientific Programming - OpenMP
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
A two-level directory organization solution for CC-NUMA systems
ICA3PP'07 Proceedings of the 7th international conference on Algorithms and architectures for parallel processing
Efficient methods for formally verifying safety properties of hierarchical cache coherence protocols
Formal Methods in System Design
Token tenure and PATCH: A predictive/adaptive token-counting hybrid
ACM Transactions on Architecture and Code Optimization (TACO)
Fractal Coherence: Scalably Verifiable Cache Coherence
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
A case for globally shared-medium on-chip interconnect
Proceedings of the 38th annual international symposium on Computer architecture
Speeding-up synchronizations in DSM multiprocessors
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
A novel lightweight directory architecture for scalable shared-memory multiprocessors
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Enhancing effective throughput for transmission line-based bus
Proceedings of the 39th Annual International Symposium on Computer Architecture
Using in-flight chains to build a scalable cache coherence protocol
ACM Transactions on Architecture and Code Optimization (TACO)
Exploiting replication to improve performances of NUCA-based CMP systems
ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Hi-index | 0.01 |
This paper describes the architecture and implementation of the AlphaServer GS320, a cache-coherent non-uniform memory access multiprocessor developed at Compaq. The AlphaServer GS320 architecture is specifically targeted at medium-scale multiprocessing with 32 to 64 processors. Each node in the design consists of four Alpha 21264 processors, up to 32GB of coherent memory, and an aggressive IO subsystem. The current implementation supports up to 8 such nodes for a total of 32 processors. While snoopy-based designs have been stretched to medium-scale multiprocessors by some vendors, providing sufficient snoop bandwidth remains a major challenge especially in systems with aggressive processors. At the same time, directory protocols targeted at larger scale designs lead to a number of inherent inefficiencies relative to snoopy designs. A key goal of the AlphaServer GS320 architecture has been to achieve the best-of-both-worlds, partly by exploiting the bounded scale of the target systems.This paper focuses on the unique design features used in the AlphaServer GS320 to efficiently implement coherence and consistency. The guiding principle for our directory-based protocol is to address correctness issues related to rare protocol races without burdening the common transaction flows. Our protocol exhibits lower occupancy and lower message counts compared to previous designs, and provides more efficient handling of 3-hop transactions. Furthermore, our design naturally lends itself to elegant solutions for deadlock, livelock, starvation, and fairness. The AlphaServer GS320 architecture also incorporates a couple of innovative techniques that extend previous approaches for efficiently implementing memory consistency models. These techniques allow us to generate commit events (which are used for ordering purposes) well in advance of formulating the reply to a transaction. Furthermore, the separation of the commit event allows time-critical replies to by-pass inbound requests without violating ordering properties. Even though our design specifically targets medium-scale servers, many of the same techniques can be applied to larger-scale directory-based and smaller-scale snoopy-based designs. Finally, we evaluate the performance impact of some of the above optimizations and present a few competitive benchmark results.