Area-Performance Trade-offs in Tiled Dataflow Architectures

Authors:
Steven Swanson;Andrew Putnam;Martha Mercaldi;Ken Michelson;Andrew Petersen;Andrew Schwerin;Mark Oskin;Susan J. Eggers
Affiliations:
University of Washington;University of Washington;University of Washington;University of Washington;University of Washington;University of Washington;University of Washington;University of Washington
Venue:
Proceedings of the 33rd annual international symposium on Computer Architecture
Year:
2006

Citing 22
Cited 10

The Manchester prototype dataflow computer

Communications of the ACM - Special section on computer architecture
The misconstrued semicolon: reconciling imperative languages and dataflow machines

The misconstrued semicolon: reconciling imperative languages and dataflow machines
Evaluation of a prototype data flow processor of the SIGMA-1 for scientific computations

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Deadlock-Free Message Routing in Multiprocessor Interconnection Networks

IEEE Transactions on Computers
An evaluation of directory schemes for cache coherence

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
The Epsilon dataflow processor

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Executing a Program on the MIT Tagged-Token Dataflow Architecture

IEEE Transactions on Computers
Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Space-time scheduling of instruction-level parallelism on a raw machine

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Monsoon: an explicit token-store architecture

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Smart Memories: a modular reconfigurable architecture

Proceedings of the 27th annual international symposium on Computer architecture
A design space evaluation of grid processor architectures

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
A preliminary architecture for a basic data-flow processor

ISCA '75 Proceedings of the 2nd annual symposium on Computer architecture
Exploring Optimal Cost-Performance Designs for Raw Microprocessors

FCCM '98 Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines
DDDP-a Distributed Data Driven Processor

ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
The architecture and system method of DDM1: A recursively structured Data Driven Machine

ISCA '78 Proceedings of the 5th annual symposium on Computer architecture
Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture

Proceedings of the 30th annual international symposium on Computer architecture
WaveScalar

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams

Proceedings of the 31st annual international symposium on Computer architecture
Performance/Watt: the new server focus

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
The WaveScalar architecture

ACM Transactions on Computer Systems (TOCS)

Modeling instruction placement on a spatial architecture

Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures
Reducing control overhead in dataflow architectures

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Instruction scheduling for a tiled dataflow architecture

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
The WaveScalar architecture

ACM Transactions on Computer Systems (TOCS)
Implementation and Evaluation of a Dynamically Routed Processor Operand Network

NOCS '07 Proceedings of the First International Symposium on Networks-on-Chip
Alternative dataflow model

ACST'07 Proceedings of the third conference on IASTED International Conference: Advances in Computer Science and Technology
Chip multiprocessor based on data-driven multithreading model

International Journal of High Performance Systems Architecture
A case for FAME: FPGA architecture model execution

Proceedings of the 37th annual international symposium on Computer architecture
A dynamic dataflow architecture using partial reconfigurable hardware as an option for multiple cores

WSEAS Transactions on Computers
Lighting the dark silicon by exploiting heterogeneity on future processors

Proceedings of the 50th Annual Design Automation Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

Tiled architectures, such as RAW, SmartMemories, TRIPS, and WaveScalar, promise to address several issues facing conventional processors, including complexity, wire-delay, and performance. The basic premise of these architectures is that larger, higher-performance implementations can be constructed by replicating the basic tile across the chip. This paper explores the area-performance trade-offs when designing one such tiled architecture, WaveScalar. We use a synthesizable RTL model and cycle-level simulator to perform an area/performance pareto analysis of over 200 WaveScalar processor designs ranging in size from 19mm2 to 378mm2 and having a 22 FO4 cycle time. We demonstrate that, for multi-threaded workloads, WaveScalar performance scales almost ideally from 19 to 101mm2 when optimized for area efficiency and from 44 to 202mm2when optimized for peak performance. Our analysis reveals that WaveScalar's hierarchical interconnect plays an important role in overall scalability, and that WaveScalar achieves the same (or higher) performance in substantially less area than either an aggressive out-of-order superscalar or Sun's Niagara CMP processor.