WaveScalar

Authors:
Steven Swanson;Ken Michelson;Andrew Schwerin;Mark Oskin
Affiliations:
Computer Science and Engineering, University of Washington;Computer Science and Engineering, University of Washington;Computer Science and Engineering, University of Washington;Computer Science and Engineering, University of Washington
Venue:
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Year:
2003

Citing 41
Cited 63

The Manchester prototype dataflow computer

Communications of the ACM - Special section on computer architecture
The misconstrued semicolon: reconciling imperative languages and dataflow machines

The misconstrued semicolon: reconciling imperative languages and dataflow machines
Evaluation of a prototype data flow processor of the SIGMA-1 for scientific computations

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Measuring Parallelism in Computation-Intensive Scientific/Engineering Applications

IEEE Transactions on Computers
Resource requirements of dataflow programs

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
The Epsilon dataflow processor

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
An architecture of a dataflow single chip processor

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Executing a Program on the MIT Tagged-Token Dataflow Architecture

IEEE Transactions on Computers
The program dependence web: a representation supporting control-, data-, and demand-driven interpretation of imperative languages

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Limits of instruction-level parallelism

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Efficiently computing static single assignment form and the control dependence graph

ACM Transactions on Programming Languages and Systems (TOPLAS)
Limits of control flow on parallelism

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Effective compiler support for predicated execution using the hyperblock

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Numerical recipes in C (2nd ed.): the art of scientific computing

Numerical recipes in C (2nd ed.): the art of scientific computing
ARB: A Hardware Mechanism for Dynamic Reordering of Memory References

IEEE Transactions on Computers
Dynamic speculation and synchronization of data dependences

Proceedings of the 24th annual international symposium on Computer architecture
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Space-time scheduling of instruction-level parallelism on a raw machine

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
DIVA: a reliable substrate for deep submicron microarchitecture design

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Monsoon: an explicit token-store architecture

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Smart Memories: a modular reconfigurable architecture

Proceedings of the 27th annual international symposium on Computer architecture
Clock rate versus IPC: the end of the road for conventional microarchitectures

Proceedings of the 27th annual international symposium on Computer architecture
The VAL Language: Description and Analysis

ACM Transactions on Programming Languages and Systems (TOPLAS)
Lucid, a nonprocedural language with iteration

Communications of the ACM
NanoFabrics: spatial computing using molecular electronics

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
An instruction set and microarchitecture for instruction level distributed processing

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A design space evaluation of grid processor architectures

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Introduction: Special Issue on Microprocessor Verifications

Formal Methods in System Design
Baring It All to Software: Raw Machines

Computer
A preliminary architecture for a basic data-flow processor

ISCA '75 Proceedings of the 2nd annual symposium on Computer architecture
Two Fundamental Limits on Dataflow Multiprocessing

PACT '93 Proceedings of the IFIP WG10.3. Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism
DDDP-a Distributed Data Driven Processor

ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Using an oracle to measure potential parallelism in single instruction stream programs

MICRO 14 Proceedings of the 14th annual workshop on Microprogramming
The architecture and system method of DDM1: A recursively structured Data Driven Machine

ISCA '78 Proceedings of the 5th annual symposium on Computer architecture
Speculative Versioning Cache

HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
Parallelism in the front-end

Proceedings of the 30th annual international symposium on Computer architecture
Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture

Proceedings of the 30th annual international symposium on Computer architecture
An evaluation of speculative instruction execution on simultaneous multithreaded processors

ACM Transactions on Computer Systems (TOCS)

TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP

ACM Transactions on Architecture and Code Optimization (TACO)
Scalable selective re-execution for EDGE architectures

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Static Placement, Dynamic Issue (SPDI) Scheduling for EDGE Architectures

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Dataflow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Scalar Operand Networks

IEEE Transactions on Parallel and Distributed Systems
Dynamic loop pipelining in data-driven architectures

Proceedings of the 2nd conference on Computing frontiers
A High Throughput String Matching Architecture for Intrusion Detection and Prevention

Proceedings of the 32nd annual international symposium on Computer Architecture
Near-Optimal Worst-Case Throughput Routing for Two-Dimensional Mesh Networks

Proceedings of the 32nd annual international symposium on Computer Architecture
Distributed Data Cache Designs for Clustered VLIW Processors

IEEE Transactions on Computers
Designing real-time H.264 decoders with dataflow architectures

CODES+ISSS '05 Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Hardware-modulated parallelism in chip multiprocessors

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Placement for configurable dataflow architecture

Proceedings of the 2005 Asia and South Pacific Design Automation Conference
Constructing Virtual Architectures on a Tiled Processor

Proceedings of the International Symposium on Code Generation and Optimization
Compiling for EDGE Architectures

Proceedings of the International Symposium on Code Generation and Optimization
Bit-split string-matching engines for intrusion detection and prevention

ACM Transactions on Architecture and Code Optimization (TACO)
Area-Performance Trade-offs in Tiled Dataflow Architectures

Proceedings of the 33rd annual international symposium on Computer Architecture
Modeling instruction placement on a spatial architecture

Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures
Reducing control overhead in dataflow architectures

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Data-Driven Multithreading Using Conventional Microprocessors

IEEE Transactions on Parallel and Distributed Systems
A spatial path scheduling algorithm for EDGE architectures

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Instruction scheduling for a tiled dataflow architecture

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Tartan: evaluating spatial computation for whole program execution

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Dataflow Predication

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
CAPSULE: Hardware-Assisted Parallel Execution of Component-Based Programs

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
The WaveScalar architecture

ACM Transactions on Computer Systems (TOCS)
Scalability of continuation-based fine-grained multithreading in handling multiple I/O requests on FUCE

Proceedings of the 4th international conference on Computing frontiers
High performance dense linear algebra on a spatially distributed processor

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
The revolution inside the box

Communications of the ACM - Web science
Counting Dependence Predictors

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Transparent reconfigurable acceleration for heterogeneous embedded applications

Proceedings of the conference on Design, automation and test in Europe
A Non-blocking Multithreaded Architecture with Support for Speculative Threads

ICA3PP '08 Proceedings of the 8th international conference on Algorithms and Architectures for Parallel Processing
HeDGE: Hybrid Dataflow Graph Execution in the Issue Logic

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Compiler Controlled Speculation for Power Aware ILP Extraction in Dataflow Architectures

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Efficient unicast and multicast support for CMPs

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Strategies for mapping dataflow blocks to distributed hardware

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Loop-Aware Instruction Scheduling with Dynamic Contention Tracking for Tiled Dataflow Architectures

CC '09 Proceedings of the 18th International Conference on Compiler Construction: Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009
PLUG: flexible lookup modules for rapid deployment of new protocols in high-speed routers

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Design and optimization of the store vectors memory dependence predictor

ACM Transactions on Architecture and Code Optimization (TACO)
rMPI: message passing on multicore processors with on-chip interconnect

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
Chip multiprocessor based on data-driven multithreading model

International Journal of High Performance Systems Architecture
The C compiler generating a source file in VHDL for a dynamic dataflow machine being executed direct into a hardware

WSEAS Transactions on Computers
A dynamic dataflow architecture using partial reconfigurable hardware as an option for multiple cores

WSEAS Transactions on Computers
Design and implementation of the PLUG architecture for programmable and efficient network lookups

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Task superscalar: using processors as functional units

HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
A scheduling approach for distributed resource architectures with scarce communication resources

International Journal of High Performance Systems Architecture
PRADA: a high-performance reconfigurable parallel architecture based on the dataflow model

International Journal of High Performance Systems Architecture
Task Superscalar: An Out-of-Order Task Pipeline

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
The C compiler generating a source file in VHDL for a dynamic dataflow machine

MMACTEE'09 Proceedings of the 11th WSEAS international conference on Mathematical methods and computational techniques in electrical engineering
A pattern for efficient parallel computation on multicore processors with scalar operand networks

Proceedings of the 2010 Workshop on Parallel Programming Patterns
Trebuchet: exploring TLP with dataflow virtualisation

International Journal of High Performance Systems Architecture
CRIB: consolidated rename, issue, and bypass

Proceedings of the 38th annual international symposium on Computer architecture
An FPGA-based heterogeneous coarse-grained dynamically reconfigurable architecture

CASES '11 Proceedings of the 14th international conference on Compilers, architectures and synthesis for embedded systems
Hardware support for OpenMP collective operations

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Bundled execution of recurring traces for energy-efficient general purpose processing

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Distributed replay protocol for distributed uniprocessors

Proceedings of the 26th ACM international conference on Supercomputing
Mixing static and dynamic strategies for high performance and low area reconfigurable systems

International Journal of High Performance Systems Architecture
Viper: virtual pipelines for enhanced reliability

Proceedings of the 39th Annual International Symposium on Computer Architecture
Elastic CGRAs

Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
MP-Tomasulo: A Dependency-Aware Automatic Parallel Execution Engine for Sequential Programs

ACM Transactions on Architecture and Code Optimization (TACO)
A general constraint-centric scheduling framework for spatial architectures

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
The von Neumann architecture is due for retirement

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
The sharing architecture: sub-core configurability for IaaS clouds

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Silicon technology will continue to provide an exponential increasein the availability of raw transistors. Effectively translatingthis resource into application performance, however,is an open challenge. Ever increasing wire-delay relativeto switching speed and the exponential cost of circuit complexitymake simply scaling up existing processor designs futile.In this paper, we present an alternative to superscalardesign, WaveScalar. WaveScalar is a dataflow instructionset architecture and execution model designed for scalable,low-complexity/high-performance processors. WaveScalar isunique among dataflow architectures in efficiently providingtraditional memory semantics. At last, a dataflow machinecan run "real-world" programs, written in any language,without sacrificing parallelism.The WaveScalar ISA is designed to run on an intelligentmemory system. Each instruction in a WaveScalar binary executesin place in the memory system and explicitly communicateswith its dependents in dataflow fashion. WaveScalararchitectures cache instructions and the values they operateon in a WaveCache, a simple grid of "alu-in-cache" nodes.By co-locating computation and data in physical space, theWaveCache minimizes long wire, high-latency communication.This paper introduces the WaveScalar instruction setand evaluates a simulated implementation based on currenttechnology. Results for the SPEC and Mediabench applicationsdemonstrate that the WaveCache out-performs an aggressivelyconfigured superscalar design by 2-7 times, withample opportunities for future optimizations.