Warp: an integrated solution of high-speed parallel computing

Authors:
S. Borkar;R. Cohn;G. Cox;S. Gleason;T. Gross
Affiliations:
Department of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania;Intel Corporation, JFl-60, 5200 N.E. Elam Young Pkwy, Hillsboro, Oregon;Department of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania;Department of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania;Department of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania
Venue:
Proceedings of the 1988 ACM/IEEE conference on Supercomputing
Year:
1988

Citing 7
Cited 104

Compilation for a high-performance systolic array

SIGPLAN '86 Proceedings of the 1986 SIGPLAN symposium on Compiler construction
Deadlock-Free Message Routing in Multiprocessor Interconnection Networks

IEEE Transactions on Computers
The warp computer: Architecture, implementation, and performance

IEEE Transactions on Computers
Low-level vision on warp and the apply programming model

Parallel computation and computers for artificial intelligence
Deadlock avoidance for systolic communication

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
A VLSI Architecture for Concurrent Data Structures

A VLSI Architecture for Concurrent Data Structures

Architecture and compiler tradeoffs for a long instruction wordprocessor

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Interprocessor communication speed and performance in distributed-memory parallel processors

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Communication in iWarp systems

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
K9: a simulator of distributed-memory parallel processors

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
FLIP-FLOP: a stack-oriented multiprocessing system

SPAA '90 Proceedings of the second annual ACM symposium on Parallel algorithms and architectures
Building and Using a Highly Parallel Programmable Logic Array

Computer - Special issue on experimental research in computer architecture
Software and hardware parallelism on the iWarp multi-computer

ICS '91 Proceedings of the 5th international conference on Supercomputing
Parallelizing a new class of large applications over high-speed networks

PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
Performance evaluation of mesh-connected wormhole-routed networks for interprocessor communication in multicomputers

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
FLIP-FLOP: a stack-oriented multiprocessing system

ACM SIGARCH Computer Architecture News - Symposium on parallel algorithms and architectures
A new approach for automatic parallelization of blocked linear Algebra computations

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Network-based multicomputers: an emerging parallel architecture

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
MOVE: a framework for high-performance processor design

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
FLIP-FLOP: a stack-oriented multiprocessing system

ACM SIGFORTH Newsletter - Special issue: Hardware
The K2 distributed memory parallel processor: architecture, compiler, and operating system

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
The turn model for adaptive routing

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Adaptive deadlock- and livelock-free routing with all minimal paths in Torus networks

SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
Supporting the hypercube programming model on mesh architectures: (a fast sorter for iWarp tori)

SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
Subset barrier synchronization on a private-memory parallel system

SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
Detection and recovery of endangered variables caused by instruction scheduling

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Exploiting task and data parallelism on a multicomputer

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Compiling task and data parallel programs for iWarp

ACM SIGPLAN Notices - Workshop on languages, compilers and run-time environments for distributed memory multiprocessors
A comparison of adaptive wormhole routing algorithms

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Supporting sets of arbitrary connections on iWarp through communication context switches

SPAA '93 Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures
Processor autonomy on SIMD architectures

ICS '93 Proceedings of the 7th international conference on Supercomputing
Anatomy of a message in the Alewife multiprocessor

ICS '93 Proceedings of the 7th international conference on Supercomputing
Latency and bandwidth considerations in parallel robotics image processing

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
A New Theory of Deadlock-Free Adaptive Routing in Wormhole Networks

IEEE Transactions on Parallel and Distributed Systems
Compiling nested data-parallel programs for shared-memory multiprocessors

ACM Transactions on Programming Languages and Systems (TOPLAS)
Parallelizing complex scans and reductions

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Fault-tolerant wormhole routing in tori

ICS '94 Proceedings of the 8th international conference on Supercomputing
Architecture implications of high-speed I/O for distributed-memory computers

ICS '94 Proceedings of the 8th international conference on Supercomputing
An architecture for optimal all-to-all personalized communication

SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
The turn model for adaptive routing

Journal of the ACM (JACM)
Adaptive Deadlock- and Livelock-Free Routing in the Hypercube Network

IEEE Transactions on Parallel and Distributed Systems
Adaptive Deadlock- and Livelock-Free Routing with All Minimal Paths in Torus Networks

IEEE Transactions on Parallel and Distributed Systems
Unicast-Based Multicast Communication in Wormhole-Routed Networks

IEEE Transactions on Parallel and Distributed Systems
Architecture and evaluation of a high-speed networking subsystem for distributed-memory systems

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Efficient Algorithms for a Class of Partitioning Problems

IEEE Transactions on Parallel and Distributed Systems
ROMM routing on mesh and torus networks

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Flexible oblivious router architecture

IBM Journal of Research and Development
Distributing a chemical process optimization application over a gigabit network

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Gigabit I/O for distributed-memory machines: architecture and applications

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Circuit-Switched Broadcasting in Torus Networks

IEEE Transactions on Parallel and Distributed Systems
A Framework for Designing Deadlock-Free Wormhole Routing Algorithms

IEEE Transactions on Parallel and Distributed Systems
On the benefit of supporting virtual channels in wormhole routers

Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
Network-Based Multicomputers: A Practical Supercomputer Architecture

IEEE Transactions on Parallel and Distributed Systems
A high-speed network interface for distributed-memory systems: architecture and applications

ACM Transactions on Computer Systems (TOCS)
Bandwidth-Optimal Complete Exchange on Wormhole-Routed 2D/3D Torus Networks: A Diagonal-Propagation Approach

IEEE Transactions on Parallel and Distributed Systems
Parallelization of FORTRAN code on distributed-memory parallel processors

ICS '90 Proceedings of the 4th international conference on Supercomputing
Determining the Order of Processor Transactions in StaticallyScheduled Multiprocessors

Journal of VLSI Signal Processing Systems
A Cost and Speed Model for k-ary n-Cube Wormhole Routers

IEEE Transactions on Parallel and Distributed Systems
Design choices in the SHRIMP system: an empirical study

Proceedings of the 25th annual international symposium on Computer architecture
The turn model for adaptive routing

25 years of the international symposia on Computer architecture (selected papers)
Cyclic-Cubes: A New Family of Interconnection Networks of Even Fixed-Degrees

IEEE Transactions on Parallel and Distributed Systems
Fault-Tolerant Communication with Partitioned Dimension-Order Routers

IEEE Transactions on Parallel and Distributed Systems
Supporting systolic and memory communication in iWarp

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
The K2 parallel processor: architecture and hardware implementation

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
PiSMA: a parallel VSM architecture

Crossroads
Fault-tolerant routing with non-adaptive wormhole algorithms in mesh networks

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
A stream compiler for communication-exposed architectures

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Simplifying Connection-Based Communication

IEEE Parallel & Distributed Technology: Systems & Technology
General-Purpose Systolic Arrays

Computer
Communication styles for parallel systems

Computer
Datawave: A Single-Chip Multiprocessor for Video Applications

IEEE Micro
iWarp: A 100-MOPS, LIW Microprocessor for Multicomputers

IEEE Micro
The Message-Driven Processor: A Multicomputer Processing Node with Efficient Mechanisms

IEEE Micro
Hypercube Communication Delay with Wormhole Routing

IEEE Transactions on Computers
Fault-Tolerant Wormhole Routing Algorithms for Mesh Networks

IEEE Transactions on Computers
Lee Distance and Topological Properties of k-ary n-cubes

IEEE Transactions on Computers
Valved Routing: Efficient Flow Control for Adaptive Nonminimal Routing in Interconnection Networks

IEEE Transactions on Computers
Limits on Interconnection Network Performance

IEEE Transactions on Parallel and Distributed Systems
Virtual-Channel Flow Control

IEEE Transactions on Parallel and Distributed Systems
Deadlock-Free Adaptive Routing in Multicomputer Networks Using Virtual Channels

IEEE Transactions on Parallel and Distributed Systems
A Network Flow Model for Load Balancing in Circuit-Switched Multicomputers

IEEE Transactions on Parallel and Distributed Systems
Performance Analysis of Mesh Interconnection Networks with Deterministic Routing

IEEE Transactions on Parallel and Distributed Systems
Fully Adaptive Minimal Deadlock-Free Packet Routing in Hypercubes, Meshes, and other Networks: Algorithms and Simulations

IEEE Transactions on Parallel and Distributed Systems
Deadlock-Free Multicast Wormhole Routing in 2-D Mesh Multicomputers

IEEE Transactions on Parallel and Distributed Systems
A Theory of Deadlock-Free Adaptive Multicast Routing in Wormhole Networks

IEEE Transactions on Parallel and Distributed Systems
A Necessary and Sufficient Condition for Deadlock-Free Adaptive Routing in Wormhole Networks

IEEE Transactions on Parallel and Distributed Systems
Parallel Processing in the DARPA Strategic Computing Vision Program

IEEE Expert: Intelligent Systems and Their Applications
A new FPGA/DSP-based parallel architecture for real-time image processing

Real-Time Imaging
Deadlock- and Livelock-Free Routing Protocols for Wave Switching

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Modeling Instruction-Level Parallelism for Software Pipelining

PACT '93 Proceedings of the IFIP WG10.3. Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism
A Parallel Algorithm for Lagrange Interpolation on k-ary n-Cubes

ParNum '99 Proceedings of the 4th International ACPC Conference Including Special Tracks on Parallel Numerics and Parallel Computing in Image Processing, Video Processing, and Multimedia: Parallel Computation
MORPH: a system architecture for robust high performance using customization (an NSF 100 TeraOps point design study)

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Fault-Tolerance with Multimodule Routers

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Universal Mechanisms for Data-Parallel Architectures

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
An overview of embedded system design education at berkeley

ACM Transactions on Embedded Computing Systems (TECS)
Embedded system education: a new paradigm for engineering schools?

ACM SIGBED Review - Special issue: The first workshop on embedded system education (WESE)
A Necessary and Sufficient Condition for Deadlock-Free Adaptive Routing in Wormhole Networks

ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 01
Routing table minimization for irregular mesh NoCs

Proceedings of the conference on Design, automation and test in Europe
Synchronization through Communication in a Massively Parallel Processor Array

IEEE Micro
An Efficient Implementation of Distributed Routing Algorithms for NoCs

NOCS '08 Proceedings of the Second ACM/IEEE International Symposium on Networks-on-Chip
rMPI: message passing on multicore processors with on-chip interconnect

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
Future of interconnect fabric: a contrarian view

Proceedings of the 12th ACM/IEEE international workshop on System level interconnect prediction
Addressing Manufacturing Challenges with Cost-Efficient Fault Tolerant Routing

NOCS '10 Proceedings of the 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip
Embedding of tori and grids into twisted cubes

Theoretical Computer Science
A dynamic programming algorithm for simulation of a multi-dimensional torus in a crossed cube

Information Sciences: an International Journal
Augmented k-ary n-cubes

Information Sciences: an International Journal
Design and implementation of an ordered memory access architecture

ICASSP'93 Proceedings of the 1993 IEEE international conference on Acoustics, speech, and signal processing: plenary, special, audio, underwater acoustics, VLSI, neural networks - Volume I
One-to-one disjoint path covers on k-ary n-cubes

Theoretical Computer Science
Scheduling independent jobs for torus connected networks with/without link contention

Mathematical and Computer Modelling: An International Journal
An efficient, low-cost routing framework for convex mesh partitions to support virtualization

ACM Transactions on Embedded Computing Systems (TECS) - Special Section on Wireless Health Systems, On-Chip and Off-Chip Network Architectures

Quantified Score

Hi-index	0.02

Visualization

Abstract

iWarp is a system architecture for high speed signal, image and scientific computing. The heart of an iWarp system is the iWarp component: a single chip processor that requires only the addition of memory chips to form a complete system building block, called the iWarp cell. Each iWarp component contains both a powerful computation engine (20 MFLOPS) and a high throughput (320 MBytes/sec), low latency (100-150 ns) communication engine for interfacing with other iWarp cells. Because of its strong computation and communication capabilities, the iWarp component is a versatile building block for various high performance parallel systems. These systems range from special purpose systolic arrays to general purpose distributed memory computers. They are able to support both fine-grain parallel and coarse-grain distributed computation models simultaneously in the same system. An iWarp system can include a large number of cells; the initial iWarp demonstration system consists of an 8x8 torus of iWarp cells, delivering more than 1.2 GFLOPS. It can be expanded to include up to 1,024 cells. This paper describes the iWarp architecture and how it supports various communication models and system configurations.