Supporting systolic and memory communication in iWarp

Authors:
Shekhar Borkar;Robert Cohn;George Cox;Thomas Gross;H. T. Kung;Monica Lam;Margie Levine;Brian Moore;Wire Moore;Craig Peterson;Jim Susman;Jim Sutton;John Urbanski;Jon Webb
Affiliations:
School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania and Intel Corporation, CO4-01, 5200 N.E. Elam Young Pkwy, Hillsboro, Oregon;School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania and Intel Corporation, CO4-01, 5200 N.E. Elam Young Pkwy, Hillsboro, Oregon;School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania and Intel Corporation, CO4-01, 5200 N.E. Elam Young Pkwy, Hillsboro, Oregon;School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania and Intel Corporation, CO4-01, 5200 N.E. Elam Young Pkwy, Hillsboro, Oregon;School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania and Intel Corporation, CO4-01, 5200 N.E. Elam Young Pkwy, Hillsboro, Oregon;School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania and Intel Corporation, CO4-01, 5200 N.E. Elam Young Pkwy, Hillsboro, Oregon;School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania and Intel Corporation, CO4-01, 5200 N.E. Elam Young Pkwy, Hillsboro, Oregon;School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania and Intel Corporation, CO4-01, 5200 N.E. Elam Young Pkwy, Hillsboro, Oregon;-;School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania and Intel Corporation, CO4-01, 5200 N.E. Elam Young Pkwy, Hillsboro, Oregon;School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania and Intel Corporation, CO4-01, 5200 N.E. Elam Young Pkwy, Hillsboro, Oregon;School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania and Intel Corporation, CO4-01, 5200 N.E. Elam Young Pkwy, Hillsboro, Oregon;School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania and Intel Corporation, CO4-01, 5200 N.E. Elam Young Pkwy, Hillsboro, Oregon;School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania and Intel Corporation, CO4-01, 5200 N.E. Elam Young Pkwy, Hillsboro, Oregon
Venue:
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Year:
1990

Citing 14
Cited 67

The warp computer: Architecture, implementation, and performance

IEEE Transactions on Computers
Low-level vision on warp and the apply programming model

Parallel computation and computers for artificial intelligence
Multicomputers: Message-Passing Concurrent Computers

Computer
Deadlock avoidance for systolic communication

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
The architecture and programming of the Ametek series 2010 multicomputer

C3P Proceedings of the third conference on Hypercube concurrent computers and applications: Architecture, software, computer systems, and general issues - Volume 1
Warp: an integrated solution of high-speed parallel computing

Proceedings of the 1988 ACM/IEEE conference on Supercomputing
Architecture and compiler tradeoffs for a long instruction wordprocessor

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
The design of nectar: a network backplane for heterogeneous multicomputers

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
An architecture independent programming language for low-level vision

Computer Vision, Graphics, and Image Processing
A parallelizing compiler for distributed memory parallel computers

A parallelizing compiler for distributed memory parallel computers
Communication in iWarp systems

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Network-based multicomputers: redefining high performance computing in the 1990s

Proceedings of the decennial Caltech conference on VLSI on Advanced research in VLSI
A VLSI Architecture for Concurrent Data Structures

A VLSI Architecture for Concurrent Data Structures
A systolic array optimizing compiler

A systolic array optimizing compiler

Efficient Doacross execution on distributed shared-memory multiprocessors

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Parallelizing a new class of large applications over high-speed networks

PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
Source level debugging of automatically parallelized code

PADD '91 Proceedings of the 1991 ACM/ONR workshop on Parallel and distributed debugging
A new approach for automatic parallelization of blocked linear Algebra computations

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Network-based multicomputers: an emerging parallel architecture

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Low-latency message communication support for the AP1000

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Supporting the hypercube programming model on mesh architectures: (a fast sorter for iWarp tori)

SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
Subset barrier synchronization on a private-memory parallel system

SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
The IBM Victor V256 partitionable multiprocessor

IBM Journal of Research and Development
Evaluation of compiler generated parallel programs on three multicomputers

ICS '92 Proceedings of the 6th international conference on Supercomputing
Register traffic analysis for streamlining inter-operation communication in fine-grain parallel processors

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Exploiting task and data parallelism on a multicomputer

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Integrating message-passing and shared-memory: early experience

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Supporting sets of arbitrary connections on iWarp through communication context switches

SPAA '93 Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures
The NuMesh: a modular, scalable communications substrate

ICS '93 Proceedings of the 7th international conference on Supercomputing
T: integrated building blocks for parallel computing

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Architecture implications of high-speed I/O for distributed-memory computers

ICS '94 Proceedings of the 8th international conference on Supercomputing
An architecture for optimal all-to-all personalized communication

SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Segment router: a novel router design for parallel computers

SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Credit-based flow control for ATM networks: credit update protocol, adaptive credit allocation and statistical multiplexing

SIGCOMM '94 Proceedings of the conference on Communications architectures, protocols and applications
Virtual memory mapped network interface for the SHRIMP multicomputer

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Architecture and evaluation of a high-speed networking subsystem for distributed-memory systems

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
A Family of Fault-Tolerant Routing Protocols for Direct Multiprocessor Networks

IEEE Transactions on Parallel and Distributed Systems
Gigabit I/O for distributed-memory machines: architecture and applications

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Analysis and implementation of hybrid switching

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
A comparison of architectural support for messaging in the TMC CM-5 and the Cray T3D

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
The M-Machine multicomputer

Proceedings of the 28th annual international symposium on Microarchitecture
Distributed, Deadlock-Free Routing in Faulty, Pipelined, Direct Interconnection Networks

IEEE Transactions on Computers
Analysis and Implementation of Hybrid Switching

IEEE Transactions on Computers
On Bufferless Routing of Variable Length Messages in Leveled Networks

IEEE Transactions on Computers
Coherent network interfaces for fine-grain communication

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Early experience with message-passing on the SHRIMP multicomputer

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Network-Based Multicomputers: A Practical Supercomputer Architecture

IEEE Transactions on Parallel and Distributed Systems
A high-speed network interface for distributed-memory systems: architecture and applications

ACM Transactions on Computer Systems (TOCS)
Compressionless Routing: A Framework for Adaptive and Fault-Tolerant Routing

IEEE Transactions on Parallel and Distributed Systems
Performance Analysis of Buffering Schemes in Wormhole Routers

IEEE Transactions on Computers
Effects of communication latency, overhead, and bandwidth in a cluster architecture

Proceedings of the 24th annual international symposium on Computer architecture
A multiprocessor DSP system using PADDI-2

DAC '98 Proceedings of the 35th annual Design Automation Conference
Virtual memory mapped network interface for the SHRIMP multicomputer

25 years of the international symposia on Computer architecture (selected papers)
Flexible and Efficient Routing Based on Progressive Deadlock Recovery

IEEE Transactions on Computers
Wormhole IP over (connectionless) ATM

IEEE/ACM Transactions on Networking (TON)
Compiler Support for Scalable and Efficient Memory Systems

IEEE Transactions on Computers
Communication and memory requirements as the basis for mapping task and data parallel programs

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Task Parallelism in a High Performance Fortran Framework

IEEE Parallel & Distributed Technology: Systems & Technology
Simplifying Connection-Based Communication

IEEE Parallel & Distributed Technology: Systems & Technology
Fast Messages: Efficient, Portable Communication for Workstation Clusters and MPPs

IEEE Parallel & Distributed Technology: Systems & Technology
Communication styles for parallel systems

Computer
Baring It All to Software: Raw Machines

Computer
iWarp: A 100-MOPS, LIW Microprocessor for Multicomputers

IEEE Micro
Virtual-Memory-Mapped Network Interfaces

IEEE Micro
Virtual-Channel Flow Control

IEEE Transactions on Parallel and Distributed Systems
HARP: An Open Architecture for Parallel Matrix and Signal Processing

IEEE Transactions on Parallel and Distributed Systems
NETRA: A Hierarchical and Partitionable Architecture for Computer Vision Systems

IEEE Transactions on Parallel and Distributed Systems
The Impact of Pipelined Channels on k-ary n-Cube Networks

IEEE Transactions on Parallel and Distributed Systems
Modeling Instruction-Level Parallelism for Software Pipelining

PACT '93 Proceedings of the IFIP WG10.3. Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism
Configurable computing: the catalyst for high-performance architectures

ASAP '97 Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures and Processors
Pipelined Multi-Queue Management in a VLSI ATM Switch Chip with Credit-Based Flow-Control

ARVLSI '97 Proceedings of the 17th Conference on Advanced Research in VLSI (ARVLSI '97)
MORPH: a system architecture for robust high performance using customization (an NSF 100 TeraOps point design study)

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Modeling virtual channel flow control in hypercubes

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
An Efficient, Low-Cost I/O Subsystem for Network Processors

IEEE Design & Test
An architecture and compiler for scalable on-chip communication

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Deadlock-free connection-based adaptive routing with dynamic virtual circuits

Journal of Parallel and Distributed Computing
Design space exploration of an optimized compiler approach for a generic reconfigurable array architecture

The Journal of Supercomputing
Continuum: A Hybrid Time/Space Communications Paradigm for k-ary n-cubes

ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 01
Massively parallel artificial intelligence

IJCAI'91 Proceedings of the 12th international joint conference on Artificial intelligence - Volume 1
Using a configurable processor generator for computer architecture prototyping

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Hardware support for multithreaded execution of loops with limited parallelism

PCI'05 Proceedings of the 10th Panhellenic conference on Advances in Informatics

Quantified Score

Hi-index	0.02

Visualization

Abstract

iWarp is a parallel architecture developed jointly by Carnegie Mellon University and Intel Corporation. The iWarp communication system supports two widely used interprocessor communication styles: memory communication and systolic communication. This paper describes the rationale, architecture, and implementation for the iWarp communication system.The sending or receiving processor of a message can perform either memory or systolic communication. In memory communication, the entire message is buffered in the local memory of the processor before it is transmitted or after it is received. Therefore communication begins or terminates at the local memory. For conventional message passing methods, both sending and receiving processors use memory communication. In systolic communication, individual data items are transferred as they are produced, or are used as they are received, by the program running at the processor. Memory communication is flexible and well suited for general computing; whereas systolic communication is efficient and well suited for speed critical applications.A major achievement of the iWarp effort is the derivation of a common design to satisfy the requirements of both systolic and memory communication styles. This is made possible by two important innovations in communication: (1) program access to communication and (2) logical channels. The former allows programs to access data as they are transmitted and to redirect portions of messages to different destinations efficiently. The latter increases the connectivity between the processors and guarantees communication bandwidth for classes of messages. These innovations have provided a focus for the iWarp architecture. The result is a communication system that provides a total bandwidth of 320 MBytes/sec and that is integrated on a single VLSI component with a 20 MFLOPS plus 20 MIPS long instruction word computation engine.