WildFire: A Scalable Path for SMPs

Authors:
Erik Hagersten;Michael Koster
Affiliations:
-;-
Venue:
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Year:
1999

Citing 0
Cited 50

Memory sharing predictor: the key to a speculative coherent DSM

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Scal-Tool: pinpointing and quantifying scalability bottlenecks in DSM multiprocessors

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Performance experiences on Sun's Wildfire prototype

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
A case for user-level dynamic page migration

Proceedings of the 14th international conference on Supercomputing
Selective, accurate, and timely self-invalidation using last-touch prediction

Proceedings of the 27th annual international symposium on Computer architecture
Comparing the effectiveness of fine-grain memory caching against page migration/replication in reducing traffic in DSM clusters

Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures
Architecture and design of AlphaServer GS320

ACM SIGPLAN Notices
Architecture and design of AlphaServer GS320

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Scalable queue-based spin locks with timeout

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Removing the overhead from software-based shared memory

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Runtime vs. Manual Data Distribution for Architecture-Agnostic Shared-Memory Programming Models

International Journal of Parallel Programming
Cache-Only Memory Architectures

Computer
Using Loop-Level Parallelism to Parallelize Vectorizable Programs

HIPS '01 Proceedings of the 6th International Workshop on High-Level Parallel Programming Models and Supportive Environments
OpenMP versus MPI for PDE Solvers Based on Regular Sparse Numerical Operators

ICCS '02 Proceedings of the International Conference on Computational Science-Part III
Quantifying and Resolving Remote Memory Access Contention on Hardware DSM Multiprocessors

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
A Tool for Binding Threads to Processors

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Performance of High-Accuracy PDE Solvers on a Self-Optimizing NUMA Architecture

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Optimizations on Array Skeletons in a Shared Memory Environment

IFL '02 Selected Papers from the 13th International Workshop on Implementation of Functional Languages
Efficient synchronization for nonuniform communication architectures

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Hierarchical Backoff Locks for Nonuniform Communication Architectures

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Memory System Behavior of Java-Based Middleware

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
User-Level Dynamic Page Migration for Multiprogrammed Shared-Memory Multiprocessors

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors

Proceedings of the 30th annual international symposium on Computer architecture
Quantifying contention and balancing memory load on hardware DSM multiprocessors

Journal of Parallel and Distributed Computing - Special section best papers from the 2002 international parallel and distributed processing symposium
The Impact of Negative Acknowledgments in Shared Memory Scientific Applications

IEEE Transactions on Parallel and Distributed Systems
Tolerating Late Memory Traps in Dynamically Scheduled Processors

IEEE Transactions on Computers
Using Hardware Counters to Automatically Improve Memory Performance

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
A Cost-Effective Main Memory Organization for Future Servers

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Reducing coherence overhead and boosting performance of high-end SMP multiprocessors running a DSS workload

Journal of Parallel and Distributed Computing
OpenMP versus MPI for PDE solvers based on regular sparse numerical operators

Future Generation Computer Systems
TMA: a trap-based memory architecture

Proceedings of the 20th annual international conference on Supercomputing
A transparent runtime data distribution engine for OpenMP

Scientific Programming
Scaling non-regular shared-memory codes by reusing custom loop schedules

Scientific Programming - OpenMP
Virtual hierarchies to support server consolidation

Proceedings of the 34th annual international symposium on Computer architecture
The case for simple, visible cache coherency

Proceedings of the 2008 ACM SIGPLAN workshop on Memory systems performance and correctness: held in conjunction with the Thirteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '08)
Bounding the minimal completion time in high-performance parallel processing

International Journal of High Performance Computing and Networking
A case for low-complexity MP architectures

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Hardware monitors for dynamic page migration

Journal of Parallel and Distributed Computing
Token tenure: PATCHing token counting using directory-based cache coherence

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Disaggregated memory for expansion and sharing in blade servers

Proceedings of the 36th annual international symposium on Computer architecture
Application of self organizing maps for investigating network latency on a broadcast-based distributed shared memory multiprocessor

Expert Systems with Applications: An International Journal
OpenMP versus MPI for PDE solvers based on regular sparse numerical operators

Future Generation Computer Systems
Predicting the performance measures of an optical distributed shared memory multiprocessor by using support vector regression

Expert Systems with Applications: An International Journal
Cohesion: a hybrid memory model for accelerators

Proceedings of the 37th annual international symposium on Computer architecture
Token tenure and PATCH: A predictive/adaptive token-counting hybrid

ACM Transactions on Architecture and Code Optimization (TACO)
WAYPOINT: scaling coherence to thousand-core architectures

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Exploiting locality: a flexible DSM approach

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Manager-client pairing: a framework for implementing coherence hierarchies

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
UniFI: leveraging non-volatile memories for a unified fault tolerance and idle power management technique

Proceedings of the 26th ACM international conference on Supercomputing
Using in-flight chains to build a scalable cache coherence protocol

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.01

Visualization

Abstract

Researchers have searched for scalable alternatives to the symmetric multiprocessor (SMP) architecture since it was first introduced in 1982. This paper introduces an alternative view of the relationship between scalable technologies and SMPs. Instead of replacing large SMPs with scalable technology, we propose new scalable techniques that allow large SMPs to be tied together efficiently, while maintaining the compatibility with, and performance characteristics of, an SMP. The trade-offs of such an architecture differ from those of traditional, scalable, Non-Uniform Memory Architecture (cc-NUMA) approaches.WildFire is a distributed shared-memory (DSM) prototype implementation based on large SMPs. It relies on two techniques for creating application-transparent locality: Coherent Memory Replication (CMR), which is a variation of Simple COMA/Reactive NUMA, and Hierarchical Affinity Scheduling (HAS). These two optimizations create extra node locality, which blurs the node boundaries to an application such that SMP-like performance can be achieved with no NUMA-specific optimizations.We present a performance study of a large OLTP benchmark running on DSMs built from various-sized nodes and with varying amounts of application-transparent locality. WildFire's measured performance is shown to be more than two times that of an unoptimized NUMA implementation built from small nodes and within 13% of the performance of the ideal implementation: a large SMP with the same access time to its entire shared memory as the local memory access time of WildFire.