The SGI Origin: a ccNUMA highly scalable server

Authors:
James Laudon;Daniel Lenoski
Affiliations:
Silicon Graphics, Inc., 2011 North Shoreline Boulevard, Mountain View, California;Silicon Graphics, Inc., 2011 North Shoreline Boulevard, Mountain View, California
Venue:
Proceedings of the 24th annual international symposium on Computer architecture
Year:
1997

Citing 12
Cited 288

Tempest and typhoon: user-level shared memory

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The performance impact of flexibility in the Stanford FLASH multiprocessor

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The MIT Alewife machine: architecture and performance

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
STiNG: a CC-NUMA computer system for the commercial marketplace

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Implementing efficient fault containment for multiprocessors: confining faults in a shared-memory multiprocessor environment

Communications of the ACM
Operating system support for improving data locality on CC-NUMA compute servers

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
The directory-based cache coherence protocol for the DASH multiprocessor

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
The MIPS R10000 Superscalar Microprocessor

IEEE Micro
The DASH Prototype: Logic Overhead and Performance

IEEE Transactions on Parallel and Distributed Systems
Using Formal Verification/Analysis Methods on the Critical Path in System Design: A Case Study

Proceedings of the 7th International Conference on Computer Aided Verification
The SGI Origin Software Environment and Application Performance

COMPCON '97 Proceedings of the 42nd IEEE International Computer Conference
The evolution of the HP/Convex Exemplar

COMPCON '97 Proceedings of the 42nd IEEE International Computer Conference

Data distribution support on distributed shared memory multiprocessors

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Improving parallel shear-warp volume rendering on shared address space multiprocessors

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Hardware fault containment in scalable shared-memory multiprocessors

Proceedings of the 24th annual international symposium on Computer architecture
Disco: running commodity operating systems on scalable multiprocessors

ACM Transactions on Computer Systems (TOCS)
Towards efficient parallel radiosity for DSM-based parallel computers using virtual interfaces

PRS '97 Proceedings of the IEEE symposium on Parallel rendering
Disco: running commodity operating systems on scalable multiprocessors

Proceedings of the sixteenth ACM symposium on Operating systems principles
Tolerating latency in multiprocessors through compiler-inserted prefetching

ACM Transactions on Computer Systems (TOCS)
Lamport clocks: verifying a directory cache-coherence protocol

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
In-memory directories: eliminating the cost of directories in CC-NUMAs

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
A methodology and an evaluation of the SGI Origin2000

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Kernel-level scheduling for the nano-threads programming model

ICS '98 Proceedings of the 12th international conference on Supercomputing
The hierarchical organization of molecular structure computations

RECOMB '98 Proceedings of the second annual international conference on Computational molecular biology
Using prediction to accelerate coherence protocols

Proceedings of the 25th annual international symposium on Computer architecture
Effects of architectural and technological advances on the HP/Convex Exemplar's memory and communication performance

Proceedings of the 25th annual international symposium on Computer architecture
Flexible use of memory for replication/migration in cache-coherent DSM multiprocessors

Proceedings of the 25th annual international symposium on Computer architecture
Analytic evaluation of shared-memory systems with ILP processors

Proceedings of the 25th annual international symposium on Computer architecture
The design of a parallel graphics interface

Proceedings of the 25th annual conference on Computer graphics and interactive techniques
Prefetching in a texture cache architecture

HWWS '98 Proceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on Graphics hardware
Hardware Support for Flexible Distributed Shared Memory

IEEE Transactions on Computers
VISA: Netstation's virtual Internet SCSI adapter

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
A Quantitative Analysis of the Performance and Scalability of Distributed Shared Memory Cache Coherence Protocols

IEEE Transactions on Computers - Special issue on cache memory and related problems
The Impact of Exploiting Instruction-Level Parallelism on Shared-Memory Multiprocessors

IEEE Transactions on Computers - Special issue on cache memory and related problems
Exploiting the Benefits of Multiple-Path Network in DSM Systems: Architectural Alternatives and Performance Evaluation

IEEE Transactions on Computers - Special issue on cache memory and related problems
A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts

IEEE Transactions on Parallel and Distributed Systems
Memory sharing predictor: the key to a speculative coherent DSM

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Multicast snooping: a new coherence method using a multicast address network

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Scaling application performance on a cache-coherent multiprocessor

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
MagPIe: MPI's collective communication operations for clustered wide area systems

Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Performance prediction of large parallel applications using parallel simulations

Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Evaluating synchronization on shared address space multiprocessors: methodology and performance

SIGMETRICS '99 Proceedings of the 1999 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Resource Scaling Effects on MPP Performance: The STAP Benchmark Implications

IEEE Transactions on Parallel and Distributed Systems
Low-level router design and its impact on supercomputer system performance

ICS '99 Proceedings of the 13th international conference on Supercomputing
Improving the performance of bristled CC-NUMA systems using virtual channels and adaptivity

ICS '99 Proceedings of the 13th international conference on Supercomputing
Thread fork/join techniques for multi-level parallelism exploitation in NUMA multiprocessors

ICS '99 Proceedings of the 13th international conference on Supercomputing
A quantitative architectural evaluation of synchronization algorithms and disciplines on ccNUMA systems: the case of the SGI Origin2000

ICS '99 Proceedings of the 13th international conference on Supercomputing
Comparing the memory system performance of the HP V-class and SGI Origin 2000 multiprocessors using microbenchmarks and scientific applications

ICS '99 Proceedings of the 13th international conference on Supercomputing
Optimal replacements in caches with two miss costs

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Verifying large-scale multiprocessors using an abstract verification environment

Proceedings of the 36th annual ACM/IEEE Design Automation Conference
Cellular Disco: resource management using virtual clusters on shared-memory multiprocessors

Proceedings of the seventeenth ACM symposium on Operating systems principles
Overlapping multi-processing and graphics hardware acceleration: performance evaluation

PVGS '99 Proceedings of the 1999 IEEE symposium on Parallel visualization and graphics
A Testbed for Evaluation of Fault-Tolerant Routing in Multiprocessor Interconnection Networks

IEEE Transactions on Parallel and Distributed Systems
The effect of state-saving in optimistic simulation on a cache-coherent non-uniform memory access architecture

Proceedings of the 31st conference on Winter simulation: Simulation---a bridge to the future - Volume 2
Scal-Tool: pinpointing and quantifying scalability bottlenecks in DSM multiprocessors

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Performance experiences on Sun's Wildfire prototype

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Improving parallel system performance by changing the arrangement of the network links

Proceedings of the 14th international conference on Supercomputing
A case for user-level dynamic page migration

Proceedings of the 14th international conference on Supercomputing
A scalable approach to thread-level speculation

Proceedings of the 27th annual international symposium on Computer architecture
Selective, accurate, and timely self-invalidation using last-touch prediction

Proceedings of the 27th annual international symposium on Computer architecture
Piranha: a scalable architecture based on single-chip multiprocessing

Proceedings of the 27th annual international symposium on Computer architecture
Comparing the effectiveness of fine-grain memory caching against page migration/replication in reducing traffic in DSM clusters

Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures
Quantitative Characterization and Analysis of the I/O Behavior of a Commercial Distributed-Shared-Memory Machine

IEEE Transactions on Parallel and Distributed Systems
Multigrain shared memory

ACM Transactions on Computer Systems (TOCS)
How to vectorize the algebraic multilevel iteration

ACM Transactions on Mathematical Software (TOMS) - Special issue in honor of John Rice's 65th birthday
Design and Evaluation of a Switch Cache Architecture for CC-NUMA Multiprocessors

IEEE Transactions on Computers
Cellular disco: resource management using virtual clusters on shared-memory multiprocessors

ACM Transactions on Computer Systems (TOCS)
Designing computer systems with MEMS-based storage

ACM SIGPLAN Notices
Architecture and design of AlphaServer GS320

ACM SIGPLAN Notices
Timestamp snooping: an approach for extending SMPs

ACM SIGPLAN Notices
Data Locality Exploitation in the Decomposition of Regular Domain Problems

IEEE Transactions on Parallel and Distributed Systems
Accelerating shared virtual memory via general-purpose network interface support

ACM Transactions on Computer Systems (TOCS)
Improving fine-grained irregular shared-memory benchmarks by data reordering

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Hardware prediction for data coherency of scientific codes on DSM

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Is data distribution necessary in OpenMP?

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Exploiting Network Locality for CC-NUMA Multiprocessors

The Journal of Supercomputing
An Analytical Model of Adaptive Wormhole Routing in Hypercubes in the Presence of Hot Spot Traffic

IEEE Transactions on Parallel and Distributed Systems
Optimistic simulation of parallel message-passing applications

Proceedings of the fifteenth workshop on Parallel and distributed simulation
Compiler-based I/O prefetching for out-of-core applications

ACM Transactions on Computer Systems (TOCS)
The trade-off between implicit and explicit data distribution in shared-memory programming paradigms

ICS '01 Proceedings of the 15th international conference on Supercomputing
A network of cellular automata for a landslide simulation

ICS '01 Proceedings of the 15th international conference on Supercomputing
The Formal Design of 1M-gate ASICs

Formal Methods in System Design - Special issue on formal methods for computer-added design
A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Designing computer systems with MEMS-based storage

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Architecture and design of AlphaServer GS320

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Timestamp snooping: an approach for extending SMPs

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Fault-Tolerant Routing in Hypercube Multicomputers Using Local Safety Information

IEEE Transactions on Parallel and Distributed Systems
Asynchrony in parallel computing: from dataflow to multithreading

Progress in computer research
A General Theory for Deadlock-Free Adaptive Routing Using a Mixed Set of Resources

IEEE Transactions on Parallel and Distributed Systems
ADir_pNB: A Cost-Effective Way to Implement Full Map Directory-Based Cache Coherence Protocols

IEEE Transactions on Computers
Reducing coherence overhead of barrier synchronization in software DSMs

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Measuring memory hierarchy performance of cache-coherent multiprocessors using micro benchmarks

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Leveraging cache coherence in active memory systems

ICS '02 Proceedings of the 16th international conference on Supercomputing
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Speculative lock elision: enabling highly concurrent multithreaded execution

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Register tiling in nonrectangular iteration spaces

ACM Transactions on Programming Languages and Systems (TOPLAS)
Asynchrony in parallel computing: from dataflow to multithreading

Progress in computer research
Integrating non-blocking synchronisation in parallel applications: performance advantages and methodologies

WOSP '02 Proceedings of the 3rd international workshop on Software and performance
A Performance Model of Adaptive Wormhole Routing in k-Ary n-Cubes in the Presence of Digit-Reversal Traffic

The Journal of Supercomputing
A simulation-based method for the verification of shared memory in multiprocessor systems

Proceedings of the 2001 IEEE/ACM international conference on Computer-aided design
An Application-Driven Study of Multicast Communication for Write Invalidation

The Journal of Supercomputing
The Architectural and Operating System Implications on the Performance of Synchronization on ccNUMA Multiprocessors

International Journal of Parallel Programming
Runtime vs. Manual Data Distribution for Architecture-Agnostic Shared-Memory Programming Models

International Journal of Parallel Programming
Design and analysis of static memory management policies for CC-NUMA Multiprocessors

Journal of Systems Architecture: the EUROMICRO Journal
A Perceptually-Driven Parallel Algorithm for Efficient Radiosity Simulation

IEEE Transactions on Visualization and Computer Graphics
Strategies for Adopting FVTD on Multicomputers

Computing in Science and Engineering
Analytic Evaluation of Shared-Memory Architectures

IEEE Transactions on Parallel and Distributed Systems
Shared Virtual Memory Clusters with Next-Generation Interconnection Networks and Wide Compute Nodes

HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
Using Loop-Level Parallelism to Parallelize Vectorizable Programs

HIPS '01 Proceedings of the 6th International Workshop on High-Level Parallel Programming Models and Supportive Environments
On Message.Dependent Deadlocks in Multiprocessor/Multicomputer Systems

HiPC '00 Proceedings of the 7th International Conference on High Performance Computing
Quantifying and Resolving Remote Memory Access Contention on Hardware DSM Multiprocessors

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Comparing the Memory System Performance of DSS Workloads on the HP V-Class and SGI Origin 2000

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
A Novel Approach to Reduce L2 Miss Latency in Shared-Memory Multiprocessors

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Performance Analysys of a CC-NUMAOperating System

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Predicting Scalability of Parallel Garbage Collectors on Shared Memory Multiprocessors

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Efficient Handling of Message-Dependent Deadlock

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
The Use of Prediction for Accelerating Upgrade Misses in cc-NUMA Multiprocessors

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
The Formal Design of 1M-gate ASICs

FMCAD '98 Proceedings of the Second International Conference on Formal Methods in Computer-Aided Design
A Tool to Schedule Parallel Applications on Multiprocessors: The NANOS CPU MANAGER

IPDPS '00/JSSPP '00 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Exploring Multi-level Parallelism in Cellular Automata Networks

ISHPC '00 Proceedings of the Third International Symposium on High Performance Computing
Impact of PE Mapping on Cray T3E Message-Passing Performance

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Message Passing Evaluation and Analysis on Cray T3E and SGI Origin 2000 Systems

Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
UPMLIB: A Runtime System for Tuning the Memory Performance of OpenMP Programs on Scalable Shared-Memory Multiprocessors

LCR '00 Selected Papers from the 5th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Working with MPI Benchmarking Suites on ccNUMA Architectures

Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Processor Mechanisms for Software Shared Memory

ISHPC '00 Proceedings of the Third International Symposium on High Performance Computing
Parallel Management of Large Dynamic Shared Memory Space: A Hierarchical FEM Application

IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
Leveraging Transparent Data Distribution in OpenMP via User-Level Dynamic Page Migration

ISHPC '00 Proceedings of the Third International Symposium on High Performance Computing
Active Memory Clusters: Efficient Multiprocessing on Commodity Clusters

ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Owner prediction for accelerating cache-to-cache transfer misses in a cc-NUMA architecture

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Efficient synchronization for nonuniform communication architectures

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A Progressive Approach to Handling Message-Dependent Deadlock in Parallel Computer Systems

IEEE Transactions on Parallel and Distributed Systems
Reducing False Sharing and Improving Spatial Locality in a Unified Compilation Framework

IEEE Transactions on Parallel and Distributed Systems
On the Design of a High-Performance Adaptive Router for CC-NUMA Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Evaluation of the memory page migration influence in the system performance: the case of the SGI O2000

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Inferential queueing and speculative push for reducing critical communication latencies

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Cost-Sensitive Cache Replacement Algorithms

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Hierarchical Backoff Locks for Nonuniform Communication Architectures

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
User-Level Dynamic Page Migration for Multiprogrammed Shared-Memory Multiprocessors

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
The Thread-Based Protocol Engines for CC-NUMA Multiprocessors

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
The NUMAchine Multiprocessor

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
A Balanced Approach to High-Level Verification: Performance Trade-Offs in Verifying Large-Scale Multiprocessors

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Algorithm engineering for parallel computation

Experimental algorithmics
Token coherence: decoupling performance and correctness

Proceedings of the 30th annual international symposium on Computer architecture
Optimizing Parallel Applications for Wide-Area Clusters

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Parallel Tree Building on a Range of Shared Address Space Multiprocessors: Algorithms and Application Performance

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
C++ Expression Templates Performance Issues in Scientific Computing

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Ocean warning: avoid drowning

ACM SIGARCH Computer Architecture News
An analytical model of wormhole-routed hypercubes under broadcast traffic

Performance Evaluation
How to vectorize the algebraic multi-level iteration

Computational science, mathematics and software
Latency, Occupancy, and Bandwidth in DSM Multiprocessors: A Performance Evaluation

IEEE Transactions on Computers
Using System Emulation to Model Next-Generation Shared Virtual Memory Clusters

Cluster Computing
A Fault-Tolerant and Deadlock-Free Routing Protocol in 2D Meshes Based on Odd-Even Turn Model

IEEE Transactions on Computers
Quantifying contention and balancing memory load on hardware DSM multiprocessors

Journal of Parallel and Distributed Computing - Special section best papers from the 2002 international parallel and distributed processing symposium
The Impact of Negative Acknowledgments in Shared Memory Scientific Applications

IEEE Transactions on Parallel and Distributed Systems
Towards general and exact distributed invalidation

Journal of Parallel and Distributed Computing
Shared virtual memory clusters: bridging the cost-performance gap between SMPs and hardware DSM systems

Journal of Parallel and Distributed Computing
Efficient Collective Communications in Dual-Cube

The Journal of Supercomputing
High-level abstractions for message-passing parallel programming

Parallel Computing - Special issue: Parallel and distributed scientific and engineering computing
Pipelined functional tree accesses and updates: scheduling, synchronization, caching and coherence

Journal of Functional Programming
Architectural Support for Uniprocessor and Multiprocessor Active Memory Systems

IEEE Transactions on Computers
An active data-aware cache consistency protocol for highly-scalable data-shipping DBMS architectures

Proceedings of the 1st conference on Computing frontiers
SMTp: An Architecture for Next-generation Scalable Multi-threading

Proceedings of the 31st annual international symposium on Computer architecture
Exploring Virtual Network Selection Algorithms in DSM Cache Coherence Protocols

IEEE Transactions on Parallel and Distributed Systems
An Architecture for High-Performance Scalable Shared-Memory Multiprocessors Exploiting On-Chip Integration

IEEE Transactions on Parallel and Distributed Systems
Improving Data Locality by Array Contraction

IEEE Transactions on Computers
A comparative evaluation of hardware-only and software-only directory protocols in shared-memory multiprocessors

Journal of Systems Architecture: the EUROMICRO Journal
On the performance of multicomputer interconnection networks

Journal of Systems Architecture: the EUROMICRO Journal
A Two-Level Directory Architecture for Highly Scalable cc-NUMA Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Using Hardware Counters to Automatically Improve Memory Performance

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
The Effect of Virtual Channel Organization on the Performance of Interconnection Networks

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 14 - Volume 15
Analytical Modelling of Hot-Spot Traffic in Deterministically-Routed K-Ary N-Cubes

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 15 - Volume 16
Cache coherence support for non-shared bus architecture on heterogeneous MPSoCs

Proceedings of the 42nd annual Design Automation Conference
Evaluating scheduling policies for fine-grain communication protocols on a cluster of SMPs

Journal of Parallel and Distributed Computing
Microarchitecture of a High-Radix Router

Proceedings of the 32nd annual international symposium on Computer Architecture
Performance analysis of a QoS capable cluster interconnect

Performance Evaluation - Performance modelling and evaluation of high-performance parallel and distributed systems
The STAMPede approach to thread-level speculation

ACM Transactions on Computer Systems (TOCS)
Shared memory computing on clusters with symmetric multiprocessors and system area networks

ACM Transactions on Computer Systems (TOCS)
Automatic thread distribution for nested parallelism in OpenMP

Proceedings of the 19th annual international conference on Supercomputing
The architecture of the HP Superdome shared-memory multiprocessor

Proceedings of the 19th annual international conference on Supercomputing
Fault-tolerant routing and multicasting in hypercubes using a partial path set-up

Parallel Computing
Formal Verification and its Impact on the Snooping versus Directory Protocol Debate

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Fast synchronization on shared-memory multiprocessors: An architectural approach

Journal of Parallel and Distributed Computing - Special issue: Design and performance of networks for super-, cluster-, and grid-computing: Part I
Page migration with dynamic space-sharing scheduling policies: the case of the SGI 02000

International Journal of Parallel Programming - Special issue II: The 17th annual international conference on supercomputing (ICS'03)
Inferential queueing and speculative push

International Journal of Parallel Programming - Special issue I: The 17th annual international conference on supercomputing (ICS'03)
An experimental evaluation of the HP V-class and SGI origin 2000 multiprocessors using microbenchmarks and scientific applications

International Journal of Parallel Programming
Efficiently generating test vectors with state pruning

Proceedings of the 2005 Asia and South Pacific Design Automation Conference
The BlackWidow High-Radix Clos Network

Proceedings of the 33rd annual international symposium on Computer Architecture
Interconnect-Aware Coherence Protocols for Chip Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
Concentration, load balancing, partial permutation routing, and superconcentration on cube-connected cycles parallel computers

Journal of Parallel and Distributed Computing
Fault-tolerant multicasting in hypercubes using local safety information

Journal of Parallel and Distributed Computing
A performance model of compressionless routing in k-ary n-cube networks

Performance Evaluation
Fault-tolerant wormhole routing with 2 virtual channels in meshes

Journal of Computer Science and Technology
Fault-tolerant routing in hypercubes using partial path set-up

Future Generation Computer Systems - Systems performance analysis and evaluation
TMA: a trap-based memory architecture

Proceedings of the 20th annual international conference on Supercomputing
Coherence Ordering for Ring-based Chip Multiprocessors

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
In-Network Cache Coherence

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Leveraging Optical Technology in Future Bus-based Chip Multiprocessors

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
The psi-cube: a bus-based cube-type clustering network for high-performance on-chip systems

Parallel Computing
A transparent runtime data distribution engine for OpenMP

Scientific Programming
Application of OpenMP to weather, wave and ocean codes

Scientific Programming
Scaling non-regular shared-memory codes by reusing custom loop schedules

Scientific Programming - OpenMP
Self-tuning reactive diffracting trees

Journal of Parallel and Distributed Computing
Efficient self-tuning spin-locks using competitive analysis

Journal of Systems and Software
Proximity-aware directory-based coherence for multi-core processor architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures

Proceedings of the 34th annual international symposium on Computer architecture
Virtual hierarchies to support server consolidation

Proceedings of the 34th annual international symposium on Computer architecture
Flattened butterfly: a cost-efficient topology for high-radix networks

Proceedings of the 34th annual international symposium on Computer architecture
Performance-driven processor allocation

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
A queueing model for predicting message latency in uni-directional k-ary n-cubes with deterministic routing and non-uniform traffic

Cluster Computing
Global memory management for a multi computer system

WSS'00 Proceedings of the 4th conference on USENIX Windows Systems Symposium - Volume 4
Windows NT in a ccNUMA system

WINSYM'99 Proceedings of the 3rd conference on USENIX Windows NT Symposium - Volume 3
Executing irregular scientific applications on stream architectures

Proceedings of the 21st annual international conference on Supercomputing
Active memory operations

Proceedings of the 21st annual international conference on Supercomputing
Synchronization coherence: A transparent hardware mechanism for cache coherence and fine-grained synchronization

Journal of Parallel and Distributed Computing
The case for simple, visible cache coherency

Proceedings of the 2008 ACM SIGPLAN workshop on Memory systems performance and correctness: held in conjunction with the Thirteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '08)
Bounding the minimal completion time in high-performance parallel processing

International Journal of High Performance Computing and Networking
Scalable barrier synchronisation for large-scale shared-memory multiprocessors

International Journal of High Performance Computing and Networking
A case for low-complexity MP architectures

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Combinable memory-block transactions

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Technology-Driven, Highly-Scalable Dragonfly Topology

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Flexible Decoupled Transactional Memory Support

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Atomic Vector Operations on Chip Multiprocessors

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Reducing the Interconnection Network Cost of Chip Multiprocessors

NOCS '08 Proceedings of the Second ACM/IEEE International Symposium on Networks-on-Chip
Circuit-Switched Coherence

NOCS '08 Proceedings of the Second ACM/IEEE International Symposium on Networks-on-Chip
A General Framework for Designing Approximation Schemes for Combinatorial Optimization Problems with Many Objectives Combined into One

APPROX '08 / RANDOM '08 Proceedings of the 11th international workshop, APPROX 2008, and 12th international workshop, RANDOM 2008 on Approximation, Randomization and Combinatorial Optimization: Algorithms and Techniques
Two proposals for the inclusion of directory information in the last-level private caches of glueless shared-memory multiprocessors

Journal of Parallel and Distributed Computing
Unicast-based fault-tolerant multicasting in wormhole-routed hypercubes

Journal of Systems Architecture: the EUROMICRO Journal
Accelerating critical section execution with asymmetric multi-core architectures

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Token tenure: PATCHing token counting using directory-based cache coherence

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Efficient unicast and multicast support for CMPs

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Dynamic data migration for structured AMR solvers

International Journal of Parallel Programming
Disaggregated memory for expansion and sharing in blade servers

Proceedings of the 36th annual international symposium on Computer architecture
A Methodology to Characterize Critical Section Bottlenecks in DSM Multiprocessors

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Technical Communication: SimuRed: A flit-level event-driven simulator for multicomputer network performance evaluation

Computers and Electrical Engineering
A tuneable software cache coherence protocol for heterogeneous MPSoCs

CODES+ISSS '09 Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Experience with building a commodity intel-based ccNUMA system

IBM Journal of Research and Development
High-throughput coherence control and hardware messaging in everest

IBM Journal of Research and Development
Application of self organizing maps for investigating network latency on a broadcast-based distributed shared memory multiprocessor

Expert Systems with Applications: An International Journal
SCARAB: a single cycle adaptive routing and bufferless network

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Pseudo-LIFO: the foundation of a new family of replacement policies for last-level caches

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
A tagless coherence directory

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Performance evaluation of directory protocols on an optical broadcast-based distributed shared memory multiprocessor

Computers and Electrical Engineering
Fault-tolerant routing and multicasting in hypercubes using a partial path set-up

Parallel Computing
A two-level directory organization solution for CC-NUMA systems

ICA3PP'07 Proceedings of the 7th international conference on Algorithms and architectures for parallel processing
The SKB: a semi-completely-connected bus for on-chip systems

NPC'07 Proceedings of the 2007 IFIP international conference on Network and parallel computing
Predicting the performance measures of an optical distributed shared memory multiprocessor by using support vector regression

Expert Systems with Applications: An International Journal
Cohesion: a hybrid memory model for accelerators

Proceedings of the 37th annual international symposium on Computer architecture
Data marshaling for multi-core architectures

Proceedings of the 37th annual international symposium on Computer architecture
The connection-then-credit flow control protocol for heterogeneous multicore systems-on-chip

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems - Special issue on the 2009 ACM/IEEE international symposium on networks-on-chip
Token tenure and PATCH: A predictive/adaptive token-counting hybrid

ACM Transactions on Architecture and Code Optimization (TACO)
Implementation tradeoffs in the design of flexible transactional memory support

Journal of Parallel and Distributed Computing
WAYPOINT: scaling coherence to thousand-core architectures

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
SPACE: sharing pattern-based directory coherence for multicore scalability

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
SWEL: hardware cache coherence protocols to map shared data onto shared caches

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Parallel algorithms

Algorithms and theory of computation handbook
Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Geographical locality and dynamic data migration for OpenMP implementations of adaptive PDE solvers

IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
Reducing the latency of L2 misses in shared-memory multiprocessors through on-chip directory integration

EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing
HPP controller: a system controller for high performance computing

Frontiers of Computer Science in China
Towards architecture independent metrics for multicore performance analysis

ACM SIGMETRICS Performance Evaluation Review
Process scheduling for future multicore processors

Proceedings of the Fifth International Workshop on Interconnection Network Architecture: On-Chip, Multi-Chip
Architectural Support for Fair Reader-Writer Locking

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
NoC-aware cache design for multithreaded execution on tiled chip multiprocessors

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Efficient unicast in bijective connection networks with the restricted faulty node set

Information Sciences: an International Journal
Research note: C-AMTE: A location mechanism for flexible cache management in chip multiprocessors

Journal of Parallel and Distributed Computing
A case for globally shared-medium on-chip interconnect

Proceedings of the 38th annual international symposium on Computer architecture
A case for NUMA-aware contention management on multicore systems

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
BarrierWatch: characterizing multithreaded workloads across and within program-defined epochs

Proceedings of the 8th ACM International Conference on Computing Frontiers
Filtering directory lookups in CMPs with write-through caches

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
A minimal average accessing time scheduler for multicore processors

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
A new hybrid directory scheme for shared memory multi-processors

CSR'06 Proceedings of the First international computer science conference on Theory and Applications
Write invalidation analysis in chip multiprocessors

PATMOS'09 Proceedings of the 19th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation
Speeding-up synchronizations in DSM multiprocessors

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Experiences with co-array fortran on hardware shared memory platforms

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
The data diffusion space for parallel computing in clusters

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
A novel lightweight directory architecture for scalable shared-memory multiprocessors

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
A hybrid strategy based on data distribution and migration for optimizing memory locality

LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing
SIMT/OMP: a toolset to study and exploit memory locality of OpenMP applications on NUMA architectures

WOMPAT'04 Proceedings of the 5th international conference on OpenMP Applications and Tools: shared Memory Parallel Programming with OpenMP
Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
An optimized multicore cache coherence design for exploiting communication locality

Proceedings of the great lakes symposium on VLSI
Why on-chip cache coherence is here to stay

Communications of the ACM
Exploration of heuristic scheduling algorithms for 3D multicore processors

Proceedings of the 15th International Workshop on Software and Compilers for Embedded Systems
A greedy heuristic approximation scheduling algorithm for 3d multicore processors

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Node-based memory management for scalable NUMA architectures

Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
On-chip networks from a networking perspective: congestion and scalability in many-core interconnects

Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication
On-chip networks from a networking perspective: congestion and scalability in many-core interconnects

ACM SIGCOMM Computer Communication Review - Special october issue SIGCOMM '12
Active memory controller

The Journal of Supercomputing
SGI® UV2: a fused computation and data analysis machine

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Concerning with on-chip network features to improve cache coherence protocols for CMPs

ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
Edge chasing delayed consistency: pushing the limits of weak memory models

Proceedings of the 2012 ACM workshop on Relaxing synchronization for multicore and manycore scalability
Moths: Mobile threads for on-chip networks

ACM Transactions on Embedded Computing Systems (TECS) - Special section on ESTIMedia'12, LCTES'11, rigorous embedded systems design, and multiprocessor system-on-chip for cyber-physical systems
Predicting Coherence Communication by Tracking Synchronization Points at Run Time

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
TRANSIT: specifying protocols with concolic snippets

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Building expressive, area-efficient coherence directories

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Multi-grain coherence directories

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Using in-flight chains to build a scalable cache coherence protocol

ACM Transactions on Architecture and Code Optimization (TACO)
High-performance fractal coherence

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.04

Visualization

Abstract

The SGI Origin 2000 is a cache-coherent non-uniform memory access (ccNUMA) multiprocessor designed and manufactured by Silicon Graphics, Inc. The Origin system was designed from the ground up as a multiprocessor capable of scaling to both small and large processor counts without any bandwidth, latency, or cost cliffs. The Origin system consists of up to 512 nodes interconnected by a scalable Craylink network. Each node consists of one or two R10000 processors, up to 4 GB of coherent memory, and a connection to a portion of the XIO IO subsystem. This paper discusses the motivation for building the Origin 2000 and then describes its architecture and implementation. In addition, performance results are presented for the NAS Parallel Benchmarks V2.2 and the SPLASH2 applications. Finally, the Origin system is compared to other contemporary commercial ccNUMA systems.