The MIT Alewife machine: architecture and performance

Authors:
Anant Agarwal;Ricardo Bianchini;David Chaiken;Kirk L. Johnson;David Kranz;John Kubiatowicz;Beng-Hong Lim;Kenneth Mackenzie;Donald Yeung
Affiliations:
Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts;University of Rochester, Rochester, NY and Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts;Digital Equipment Corporation Systems Research, Center, Palo Alto, CA and Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts;Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts;Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts;Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts;IBM T.J. Watson Research Center, Yorktown, Heights, NY and Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts;Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts;Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts
Venue:
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Year:
1995

Citing 24
Cited 131

I-structures: data structures for parallel computing

ACM Transactions on Programming Languages and Systems (TOPLAS)
Mul-T: a high-performance parallel Lisp

PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
LimitLESS directories: A scalable cache coherence scheme

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
SPLASH: Stanford parallel applications for shared-memory

ACM SIGARCH Computer Architecture News
T: a multithreaded massively parallel architecture

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Active messages: a mechanism for integrated communication and computation

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
DDM: A Cache-Only Memory Architecture

Computer
Exploiting heterogeneous parallelism on a multithreaded multiprocessor

ICS '92 Proceedings of the 6th international conference on Supercomputing
Closing the window of vulnerability in multiphase memory transactions

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Integrating message-passing and shared-memory: early experience

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Experience with fine-grain synchronization in MIMD machines for preconditioned conjugate gradient

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
The design of the Caltech Mosaic C multicomputer

Proceedings of the 1993 symposium on Research on integrated systems
The J-machine multicomputer: an architectural evaluation

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Anatomy of a message in the Alewife multiprocessor

ICS '93 Proceedings of the 7th international conference on Supercomputing
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Software-extended coherent shared memory: performance and cost

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tempest and typhoon: user-level shared memory

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Monsoon: an explicit token-store architecture

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
APRIL: a processor architecture for multiprocessing

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Sparcle: An Evolutionary Processor Design for Large-Scale Multiprocessors

IEEE Micro
Lazy Task Creation: A Technique for Increasing the Granularity of Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
The DASH Prototype: Logic Overhead and Performance

IEEE Transactions on Parallel and Distributed Systems
Lockup-free instruction fetch/prefetch cache organization

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
FUGU: Implementing Translation and Protection in a Multiuser, Multimodel Multiprocessor

FUGU: Implementing Translation and Protection in a Multiuser, Multimodel Multiprocessor

Optimistic active messages: a mechanism for scheduling communication with computation

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Scalable concurrent counting

ACM Transactions on Computer Systems (TOCS)
CRL: high-performance all-software distributed shared memory

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Architectural mechanisms for explicit communication in shared memory multiprocessors

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
A Framework for Designing Deadlock-Free Wormhole Routing Algorithms

IEEE Transactions on Parallel and Distributed Systems
Decoupled hardware support for distributed shared memory

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
MGS: a multigrain shared memory system

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Application and architectural bottlenecks in large scale distributed shared memory machines

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Polling watchdog: combining polling and interrupts for efficient message handling

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Coherent network interfaces for fine-grain communication

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Informing memory operations: providing memory performance feedback in modern processors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
STiNG: a CC-NUMA computer system for the commercial marketplace

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Limits on the performance benefits of multithreading and prefetching

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Integrating performance monitoring and communication in parallel computers

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Using dataflow analysis techniques to reduce ownership overhead in cache coherence protocols

ACM Transactions on Programming Languages and Systems (TOPLAS)
An evaluation of memory consistency models for shared-memory systems with ILP processors

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Synchronization and communication in the T3E multiprocessor

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Hiding communication latency and coherence overhead in software DSMs

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
SoftFLASH: analyzing the performance of clustered distributed virtual shared memory

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Memory organization in multi-channel optical networks: NUMA and COMA revisited

ICS '96 Proceedings of the 10th international conference on Supercomputing
Reducing synchronization overhead in parallel simulation

PADS '96 Proceedings of the tenth workshop on Parallel and distributed simulation
Fine-grain multithreading with the EM-X multiprocessor

Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Application restructuring and performance portability on shared virtual memory and hardware-coherent multiprocessors

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
LoPC: modeling contention in parallel algorithms

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Effects of communication latency, overhead, and bandwidth in a cluster architecture

Proceedings of the 24th annual international symposium on Computer architecture
Coherence controller architectures for SMP-based CC-NUMA multiprocessors

Proceedings of the 24th annual international symposium on Computer architecture
Reactive NUMA: a design for unifying S-COMA and CC-NUMA

Proceedings of the 24th annual international symposium on Computer architecture
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
Cashmere-2L: software coherent shared memory on a clustered remote-write network

Proceedings of the sixteenth ACM symposium on Operating systems principles
An interaction of coherence protocols and memory consistency models in DSM systems

ACM SIGOPS Operating Systems Review
Performance analysis on a CC-NUMA prototype

IBM Journal of Research and Development - Special issue: performance analysis and its impact on design
Design and implementation of the NUMAchine multiprocessor

DAC '98 Proceedings of the 35th annual Design Automation Conference
Per-Node Multithreading and Remote Latency

IEEE Transactions on Computers
In-memory directories: eliminating the cost of directories in CC-NUMAs

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
A methodology and an evaluation of the SGI Origin2000

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
LoGPC: modeling network contention in message-passing programs

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Support for Efficient Programming on the SB-PRAM

International Journal of Parallel Programming
Informing memory operations: memory performance feedback mechanisms and their applications

ACM Transactions on Computer Systems (TOCS)
Design choices in the SHRIMP system: an empirical study

Proceedings of the 25th annual international symposium on Computer architecture
Resource Deadlocks and Performance of Wormhole Multicast Routing Algorithms

IEEE Transactions on Parallel and Distributed Systems
Pc-based Shared Memory Architecture and Language

The Journal of Supercomputing
Evaluating the Effect of Coherence Protocols on the Performance of Parallel Programming Constructs

International Journal of Parallel Programming
Hardware Support for Flexible Distributed Shared Memory

IEEE Transactions on Computers
Exploiting the Benefits of Multiple-Path Network in DSM Systems: Architectural Alternatives and Performance Evaluation

IEEE Transactions on Computers - Special issue on cache memory and related problems
Coherence Controller Architectures for Scalable Shared-Memory Multiprocessors

IEEE Transactions on Computers - Special issue on cache memory and related problems
Scaling application performance on a cache-coherent multiprocessor

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
The scalability of multigrain systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
PiSMA: a parallel VSM architecture

Crossroads
Multigrain shared memory

ACM Transactions on Computer Systems (TOCS)
Dynamic computation migration in DSM systems

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Simulation of the 3 dimensional cascade flow with numerical wind tunnel (NWT)

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Exploiting Network Locality for CC-NUMA Multiprocessors

The Journal of Supercomputing
LoGPC: Modeling Network Contention in Message-Passing Programs

IEEE Transactions on Parallel and Distributed Systems
Optimal tiling for minimizing communication in distributed shared-memory multiprocessors

Compiler optimizations for scalable parallel systems
Tolerating communication latency through dynamic thread invocation in a multithreaded architecture

Compiler optimizations for scalable parallel systems
Asynchrony in parallel computing: from dataflow to multithreading

Progress in computer research
Scheduled Dataflow: Execution Paradigm, Architecture, and Performance Evaluation

IEEE Transactions on Computers - Special issue on the parallel architecture and compilation techniques conference
ADir_pNB: A Cost-Effective Way to Implement Full Map Directory-Based Cache Coherence Protocols

IEEE Transactions on Computers
A Fast and Efficient Processor Allocation Scheme for Mesh-Connected Multicomputers

IEEE Transactions on Computers
Asynchrony in parallel computing: from dataflow to multithreading

Progress in computer research
Non-blocking timeout in scalable queue-based spin locks

Proceedings of the twenty-first annual symposium on Principles of distributed computing
An Application-Driven Study of Multicast Communication for Write Invalidation

The Journal of Supercomputing
Load Balancing for Parallel Query Execution on NUMA Multiprocessors

Distributed and Parallel Databases
Techniques for Compiler-Directed Cache Coherence

IEEE Parallel & Distributed Technology: Systems & Technology
Application Performance on the MIT Alewife Machine

Computer
Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
A survey of processors with explicit multithreading

ACM Computing Surveys (CSUR)
Comparative Analysis of Adaptive Wormhole Routing in Tori and Hypercubes in the Presence of Hotspot Traffic

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Effects of Multithreading on Data and Workload Distribution for Distributed-Memory Multiprocessors

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Relative Performance of Hardware and Software-Only Directory Protocols Under Latency Tolerating and Reducing Techniques

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
An Evaluation of a Commercial CC-NUMA Architecture: The CONVEX Exemplar SPP1200

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Latency Tolerance: A Metric for Performance Analysis of Multithreaded Architectures

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Coherent Block Data Transfer in the FLASH Multiprocessor

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
A Queuing Model of a Multi-threaded Architecture: A Case Study

PaCT '999 Proceedings of the 5th International Conference on Parallel Computing Technologies
Performance of MP3D on the SB-PRAM Prototype (Research Note)

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Real PRAM Programming

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Processor Mechanisms for Software Shared Memory

ISHPC '00 Proceedings of the Third International Symposium on High Performance Computing
A Progressive Approach to Handling Message-Dependent Deadlock in Parallel Computer Systems

IEEE Transactions on Parallel and Distributed Systems
Data locality sensitivity of multithreaded computations on a distributed-memory multiprocessor

CASCON '96 Proceedings of the 1996 conference of the Centre for Advanced Studies on Collaborative research
Scalability in computing for today and tomorrow

ARVLSI '97 Proceedings of the 17th Conference on Advanced Research in VLSI (ARVLSI '97)
Performance Evaluation of a Cluster-Based Multiprocessor Built from ATM Switches and Bus-Based Multiprocessor Servers

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Measurement and Modeling of EARTH-MANNA Multithreaded Architecture

MASCOTS '96 Proceedings of the 4th International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems
Comparative Study of Parallel vs. Distributed Genetic Algorithm Implementation for ATM Networking Environment

ISCC '00 Proceedings of the Fifth IEEE Symposium on Computers and Communications (ISCC 2000)
The Thread-Based Protocol Engines for CC-NUMA Multiprocessors

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Latency, Occupancy, and Bandwidth in DSM Multiprocessors: A Performance Evaluation

IEEE Transactions on Computers
SMTp: An Architecture for Next-generation Scalable Multi-threading

Proceedings of the 31st annual international symposium on Computer architecture
Non-strict execution in parallel and distributed computing

International Journal of Parallel Programming
A comparative evaluation of hardware-only and software-only directory protocols in shared-memory multiprocessors

Journal of Systems Architecture: the EUROMICRO Journal
On the performance of multicomputer interconnection networks

Journal of Systems Architecture: the EUROMICRO Journal
A Two-Level Directory Architecture for Highly Scalable cc-NUMA Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Microarchitecture of a High-Radix Router

Proceedings of the 32nd annual international symposium on Computer Architecture
Shared memory computing on clusters with symmetric multiprocessors and system area networks

ACM Transactions on Computer Systems (TOCS)
Fairness and Throughput in Switch on Event Multithreading

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
A comparison of the effect of branch prediction on multithreaded and scalar architectures

ACM SIGARCH Computer Architecture News
K42: building a complete operating system

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Probabilistic analysis on mesh network fault tolerance

Journal of Parallel and Distributed Computing
HP scalable computing architecture

WIESS'00 Proceedings of the 1st conference on Industrial Experiences with Systems Software - Volume 1
Experience with a language for writing coherence protocols

DSL'97 Proceedings of the Conference on Domain-Specific Languages on Conference on Domain-Specific Languages (DSL), 1997
Windows NT in a ccNUMA system

WINSYM'99 Proceedings of the 3rd conference on USENIX Windows NT Symposium - Volume 3
Fairness enforcement in switch on event multithreading

ACM Transactions on Architecture and Code Optimization (TACO)
Towards an active network architecture

ACM SIGCOMM Computer Communication Review
TriBA: a novel scalable architecture for high performance parallel computing applications

ACOS'07 Proceedings of the 6th Conference on WSEAS International Conference on Applied Computer Science - Volume 6
Reducing the Interconnection Network Cost of Chip Multiprocessors

NOCS '08 Proceedings of the Second ACM/IEEE International Symposium on Networks-on-Chip
Disaggregated memory for expansion and sharing in blade servers

Proceedings of the 36th annual international symposium on Computer architecture
A memory system design framework: creating smart memories

Proceedings of the 36th annual international symposium on Computer architecture
Experience with building a commodity intel-based ccNUMA system

IBM Journal of Research and Development
High-throughput coherence control and hardware messaging in everest

IBM Journal of Research and Development
Fault-tolerant mapping of a mesh network in a flexible hypercube

WSEAS Transactions on Computers
Flexible architectural support for fine-grain scheduling

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
An asymmetric distributed shared memory model for heterogeneous parallel systems

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Lower bounds on the connectivity probability for 2-D mesh networks

WiCOM'09 Proceedings of the 5th International Conference on Wireless communications, networking and mobile computing
Type systems for distributed data sharing

SAS'03 Proceedings of the 10th international conference on Static analysis
MIPS MT: a multithreaded RISC architecture for embedded real-time processing

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
Inter-task communication via overlapping read and write windows for deadlock-free execution of cyclic task graphs

SAMOS'09 Proceedings of the 9th international conference on Systems, architectures, modeling and simulation
Supporting distributed shared memory on multi-core network-on-chips using a dual microcoded controller

Proceedings of the Conference on Design, Automation and Test in Europe
HPP controller: a system controller for high performance computing

Frontiers of Computer Science in China
Architectural Support for Fair Reader-Writer Locking

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Crunching large graphs with commodity processors

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
CudaDMA: optimizing GPU memory bandwidth via warp specialization

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Comparison of SBA – family task allocation algorithms for mesh structured networks

ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking
Upper bounds on the connection probability for 2-D meshes and tori

Journal of Parallel and Distributed Computing
The data diffusion space for parallel computing in clusters

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Fault tolerance analysis of mesh networks with uniform versus nonuniform node failure probability

Information Processing Letters
Static and dynamic allocation algorithms in mesh structured networks

ICDCIT'06 Proceedings of the Third international conference on Distributed Computing and Internet Technology
SuperCoP: a general, correct, and performance-efficient supervised memory system

Proceedings of the 9th conference on Computing Frontiers
PARDIS: a programmable memory controller for the DDRx interfacing standards

Proceedings of the 39th Annual International Symposium on Computer Architecture
Support for fine-grained synchronization in shared-memory multiprocessors

PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
Reducing Virtual-to-Physical address translation overhead in Distributed Shared Memory based multi-core Network-on-Chips according to data property

Computers and Electrical Engineering
A programmable memory controller for the DDRx interfacing standards

ACM Transactions on Computer Systems (TOCS)
Scale-out NUMA

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Elimination Trees and the Construction of Pools and Stacks

Theory of Computing Systems

Quantified Score

Hi-index	0.02

Visualization

Abstract

Alewife is a multiprocessor architecture that supports up to 512 processing nodes connected over a scalable and cost-effective mesh network at a constant cost per node. The MIT Alewife machine, a prototype implementation of the architecture, demonstrates that a parallel system can be both scalable and programmable. Four mechanisms combine to achieve these goals: software-extended coherent shared memory provides a global, linear address space; integrated message passing allows compiler and operating system designers to provide efficient communication and synchronization; support for fine-grain computation allows many processors to cooperate on small problem sizes; and latency tolerance mechanisms --- including block multithreading and prefetching --- mask unavoidable delays due to communication.Microbenchmarks, together with over a dozen complete applications running on the 32-node prototype, help to analyze the behavior of the system. Analysis shows that integrating message passing with shared memory enables a cost-efficient solution to the cache coherence problem and provides a rich set of programming primitives. Block multithreading and prefetching improve performance by up to 25% individually, and 35% together. Finally, language constructs that allow programmers to express fine-grain synchronization can improve performance by over a factor of two.