The performance impact of flexibility in the Stanford FLASH multiprocessor

Authors:
Mark Heinrich;Jeffrey Kuskin;David Ofelt;John Heinlein;Joel Baxter;Jaswinder Pal Singh;Richard Simoni;Kourosh Gharachorloo;David Nakahira;Mark Horowitz;Anoop Gupta;Mendel Rosenblum;John Hennessy
Affiliations:
Computer Systems Laboratory, Stanford University, Stanford, CA;Computer Systems Laboratory, Stanford University, Stanford, CA;Computer Systems Laboratory, Stanford University, Stanford, CA;Computer Systems Laboratory, Stanford University, Stanford, CA;Computer Systems Laboratory, Stanford University, Stanford, CA;Computer Systems Laboratory, Stanford University, Stanford, CA;Computer Systems Laboratory, Stanford University, Stanford, CA;Computer Systems Laboratory, Stanford University, Stanford, CA;Computer Systems Laboratory, Stanford University, Stanford, CA;Computer Systems Laboratory, Stanford University, Stanford, CA;Computer Systems Laboratory, Stanford University, Stanford, CA;Computer Systems Laboratory, Stanford University, Stanford, CA;Computer Systems Laboratory, Stanford University, Stanford, CA
Venue:
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Year:
1994

Citing 14
Cited 47

Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
SPLASH: Stanford parallel applications for shared-memory

ACM SIGARCH Computer Architecture News
T: a multithreaded massively parallel architecture

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Simulation of multiprocessors: accuracy and performance

Simulation of multiprocessors: accuracy and performance
Working sets, cache sizes, and node granularity issues for large-scale multiprocessors

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
The J-machine multicomputer: an architectural evaluation

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Cache coherence directories for scalable multiprocessors

Cache coherence directories for scalable multiprocessors
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tempest and typhoon: user-level shared memory

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Integration of message passing and shared memory in the Stanford FLASH multiprocessor

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The performance advantages of integrating block data transfer in cache-coherent multiprocessors

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR

THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR
Support for Speculative Execution in High-Performance Processors

Support for Speculative Execution in High-Performance Processors
SimOS: A Fast Operating System Simulation Environment

SimOS: A Fast Operating System Simulation Environment

Loading databases using dataflow parallelism

ACM SIGMOD Record
Integration of message passing and shared memory in the Stanford FLASH multiprocessor

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
A comprehensive bibliography of distributed shared memory

ACM SIGOPS Operating Systems Review
Architectural mechanisms for explicit communication in shared memory multiprocessors

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Unified compilation techniques for shared and distributed address space machines

ICS '95 Proceedings of the 9th international conference on Supercomputing
Efficient strategies for software-only protocols in shared-memory multiprocessors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Decoupled hardware support for distributed shared memory

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Application and architectural bottlenecks in large scale distributed shared memory machines

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Coherent network interfaces for fine-grain communication

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
STiNG: a CC-NUMA computer system for the commercial marketplace

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Integrating performance monitoring and communication in parallel computers

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Application restructuring and performance portability on shared virtual memory and hardware-coherent multiprocessors

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Hardware fault containment in scalable shared-memory multiprocessors

Proceedings of the 24th annual international symposium on Computer architecture
Effects of communication latency, overhead, and bandwidth in a cluster architecture

Proceedings of the 24th annual international symposium on Computer architecture
Coherence controller architectures for SMP-based CC-NUMA multiprocessors

Proceedings of the 24th annual international symposium on Computer architecture
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
Digital system simulation: methodologies and examples

DAC '98 Proceedings of the 35th annual Design Automation Conference
In-memory directories: eliminating the cost of directories in CC-NUMAs

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Flexible use of memory for replication/migration in cache-coherent DSM multiprocessors

Proceedings of the 25th annual international symposium on Computer architecture
Analytic evaluation of shared-memory systems with ILP processors

Proceedings of the 25th annual international symposium on Computer architecture
Hardware Support for Flexible Distributed Shared Memory

IEEE Transactions on Computers
A Quantitative Analysis of the Performance and Scalability of Distributed Shared Memory Cache Coherence Protocols

IEEE Transactions on Computers - Special issue on cache memory and related problems
Coherence Controller Architectures for Scalable Shared-Memory Multiprocessors

IEEE Transactions on Computers - Special issue on cache memory and related problems
A fully associative software-managed cache design

Proceedings of the 27th annual international symposium on Computer architecture
FLASH vs. (simulated) FLASH: closing the simulation loop

ACM SIGPLAN Notices
Memory Hierarchy Considerations for Cost-Effective Cluster Computing

IEEE Transactions on Computers
FLASH vs. (Simulated) FLASH: closing the simulation loop

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Optimizing software cache-coherent cluster architectures

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Leveraging cache coherence in active memory systems

ICS '02 Proceedings of the 16th international conference on Supercomputing
An Application-Driven Study of Multicast Communication for Write Invalidation

The Journal of Supercomputing
Complete Computer System Simulation: The SimOS Approach

IEEE Parallel & Distributed Technology: Systems & Technology
How Much Does Network Contention Affect Distributed Shared Memory Performance?

ICPP '97 Proceedings of the international Conference on Parallel Processing
Hardware Versus Software Implementation of COMA

ICPP '97 Proceedings of the international Conference on Parallel Processing
Coherent Block Data Transfer in the FLASH Multiprocessor

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Processor Mechanisms for Software Shared Memory

ISHPC '00 Proceedings of the Third International Symposium on High Performance Computing
Scalability in computing for today and tomorrow

ARVLSI '97 Proceedings of the 17th Conference on Advanced Research in VLSI (ARVLSI '97)
A Hierarchical Memory Directory Scheme Via Extending SCI for Large-Scale Multiprocessors

HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
The Thread-Based Protocol Engines for CC-NUMA Multiprocessors

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Latency, Occupancy, and Bandwidth in DSM Multiprocessors: A Performance Evaluation

IEEE Transactions on Computers
The Impact of Negative Acknowledgments in Shared Memory Scientific Applications

IEEE Transactions on Parallel and Distributed Systems
Spinach: a liberty-based simulator for programmable network interface architectures

Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
SMTp: An Architecture for Next-generation Scalable Multi-threading

Proceedings of the 31st annual international symposium on Computer architecture
Exploring Virtual Network Selection Algorithms in DSM Cache Coherence Protocols

IEEE Transactions on Parallel and Distributed Systems
A comparative evaluation of hardware-only and software-only directory protocols in shared-memory multiprocessors

Journal of Systems Architecture: the EUROMICRO Journal
Programmable bus/memory controllers in modern computer architecture

Proceedings of the 43rd annual Southeast regional conference - Volume 1
A new trace-driven shared-memory multiprocessors machine simulator

International Journal of Computers and Applications
An asymmetric distributed shared memory model for heterogeneous parallel systems

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

A flexible communication mechanism is a desirable feature in multiprocessors because it allows support for multiple communication protocols, expands performance monitoring capabilities, and leads to a simpler design and debug process. In the Stanford FLASH multiprocessor, flexibility is obtained by requiring all transactions in a node to pass through a programmable node controller, called MAGIC. In this paper, we evaluate the performance costs of flexibility by comparing the performance of FLASH to that of an idealized hardwired machine on representative parallel applications and a multiprogramming workload. To measure the performance of FLASH, we use a detailed simulator of the FLASH and MAGIC designs, together with the code sequences that implement the cache-coherence protocol. We find that for a range of optimized parallel applications the performance differences between the idealized machine and FLASH are small. For these programs, either the miss rates are small or the latency of the programmable protocol can be hidden behind the memory access time. For applications that incur a large number of remote misses or exhibit substantial hot-spotting, performance is poor for both machines, though the increased remote access latencies or the occupancy of MAGIC lead to lower performance for the flexible design. In most cases, however, FLASH is only 2%–12% slower than the idealized machine.