The MIT Alewife machine: architecture and performance

  • Authors:
  • Anant Agarwal;Ricardo Bianchini;David Chaiken;Kirk L. Johnson;David Kranz;John Kubiatowicz;Beng-Hong Lim;Kenneth Mackenzie;Donald Yeung

  • Affiliations:
  • Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts;University of Rochester, Rochester, NY and Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts;Digital Equipment Corporation Systems Research, Center, Palo Alto, CA and Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts;Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts;Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts;Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts;IBM T.J. Watson Research Center, Yorktown, Heights, NY and Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts;Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts;Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts

  • Venue:
  • ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
  • Year:
  • 1995

Quantified Score

Hi-index 0.02

Visualization

Abstract

Alewife is a multiprocessor architecture that supports up to 512 processing nodes connected over a scalable and cost-effective mesh network at a constant cost per node. The MIT Alewife machine, a prototype implementation of the architecture, demonstrates that a parallel system can be both scalable and programmable. Four mechanisms combine to achieve these goals: software-extended coherent shared memory provides a global, linear address space; integrated message passing allows compiler and operating system designers to provide efficient communication and synchronization; support for fine-grain computation allows many processors to cooperate on small problem sizes; and latency tolerance mechanisms --- including block multithreading and prefetching --- mask unavoidable delays due to communication.Microbenchmarks, together with over a dozen complete applications running on the 32-node prototype, help to analyze the behavior of the system. Analysis shows that integrating message passing with shared memory enables a cost-efficient solution to the cache coherence problem and provides a rich set of programming primitives. Block multithreading and prefetching improve performance by up to 25% individually, and 35% together. Finally, language constructs that allow programmers to express fine-grain synchronization can improve performance by over a factor of two.