A design space evaluation of grid processor architectures

Authors:
Ramadass Nagarajan;Karthikeyan Sankaralingam;Doug Burger;Stephen W. Keckler
Affiliations:
The University of Texas, Austin;The University of Texas, Austin;The University of Texas, Austin;The University of Texas, Austin
Venue:
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Year:
2001

Citing 21
Cited 53

Executing a Program on the MIT Tagged-Token Dataflow Architecture

IEEE Transactions on Computers
Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Express Cubes: Improving the Performance of k-ary n-cube Interconnection Networks

IEEE Transactions on Computers
MOVE: a framework for high-performance processor design

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Effective compiler support for predicated execution using the hyperblock

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Multiscalar processors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Increasing the instruction fetch rate via block-structured instruction set architectures

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Dynamically scheduled VLIW processors

MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Improving superscalar instruction dispatch and issue by exploiting dynamic code sequences

Proceedings of the 24th annual international symposium on Computer architecture
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
Trace processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Integrated predicated and speculative execution in the IMPACT EPIC architecture

Proceedings of the 25th annual international symposium on Computer architecture
An empirical study of decentralized ILP execution models

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Clock rate versus IPC: the end of the road for conventional microarchitectures

Proceedings of the 27th annual international symposium on Computer architecture
Baring It All to Software: Raw Machines

Computer
The Alpha 21264 Microprocessor

IEEE Micro
A preliminary architecture for a basic data-flow processor

ISCA '75 Proceedings of the 2nd annual symposium on Computer architecture
Using Sacks to Organize Registers in VLIW Machines

CONPAR 94 - VAPP VI Proceedings of the Third Joint International Conference on Vector and Parallel Processing: Parallel Processing
Very Long Instruction Word architectures and the ELI-512

ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Register Queues: A New Hardware/Software Approach to Efficient Software Pipelining

PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques

Slack: maximizing performance under technological constraints

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A large, fast instruction window for tolerating cache misses

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
An instruction set and microarchitecture for instruction level distributed processing

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Coping with Latency in SOC Design

IEEE Micro
Realizing High IPC Using Time-Tagged Resource-Flow Computing

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Realizing high IPC through a scalable memory-latency tolerant multipath microarchitecture

ACM SIGARCH Computer Architecture News
Convergent scheduling

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Dynamically managing the communication-parallelism trade-off in future clustered processors

Proceedings of the 30th annual international symposium on Computer architecture
Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture

Proceedings of the 30th annual international symposium on Computer architecture
WaveScalar

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Universal Mechanisms for Data-Parallel Architectures

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Scalable Hardware Memory Disambiguation for High ILP Processors

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
TLC: Transmission Line Caches

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
BLOB computing

Proceedings of the 1st conference on Computing frontiers
TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP

ACM Transactions on Architecture and Code Optimization (TACO)
Cluster prefetch: tolerating on-chip wire delays in clustered microarchitectures

Proceedings of the 18th annual international conference on Supercomputing
Synchroscalar: A Multiple Clock Domain, Power-Aware, Tile-Based Embedded Processor

Proceedings of the 31st annual international symposium on Computer architecture
From Sequences of Dependent Instructions to Functions: An Approach for Improving Performance without ILP or Speculation

Proceedings of the 31st annual international symposium on Computer architecture
Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams

Proceedings of the 31st annual international symposium on Computer architecture
A scalable, clustered SMT processor for digital signal processing

MEDEA '03 Proceedings of the 2003 workshop on MEmory performance: DEaling with Applications , systems and architecture
Scalable selective re-execution for EDGE architectures

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Dataflow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
On-Chip Interconnects and Instruction Steering Schemes for Clustered Microarchitectures

IEEE Transactions on Parallel and Distributed Systems
Scalar Operand Networks

IEEE Transactions on Parallel and Distributed Systems
Inherently Workload-Balanced Clustered Microarchitecture

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Technology-based Architectural Analysis of Operand Bypass Networks for Efficient Operand Transport

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 15 - Volume 16
Implications of Executing Compression and Encryption Applications on General Purpose Processors

IEEE Transactions on Computers
Processor Enhancements for Media Streaming Applications

Journal of VLSI Signal Processing Systems
Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
A Design Methodology for Efficient Application-Specific On-Chip Interconnects

IEEE Transactions on Parallel and Distributed Systems
Compiling for EDGE Architectures

Proceedings of the International Symposium on Code Generation and Optimization
Area-Performance Trade-offs in Tiled Dataflow Architectures

Proceedings of the 33rd annual international symposium on Computer Architecture
Reducing control overhead in dataflow architectures

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Instruction scheduling for a tiled dataflow architecture

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
The WaveScalar architecture

ACM Transactions on Computer Systems (TOCS)
Hybrid multi-core architecture for boosting single-threaded performance

ACM SIGARCH Computer Architecture News
On Characterizing Performance of the Cell Broadband Engine Element Interconnect Bus

NOCS '07 Proceedings of the First International Symposium on Networks-on-Chip
Trends toward on-chip networked microsystems

International Journal of High Performance Computing and Networking
The revolution inside the box

Communications of the ACM - Web science
A distributed, simultaneously multi-threaded (SMT) processor with clustered scheduling windows for scalable DSP performance

Journal of Signal Processing Systems - Special Issue: Embedded computing systems for DSP
HeDGE: Hybrid Dataflow Graph Execution in the Issue Logic

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Convergent Compilation Applied to Loop Unrolling

Transactions on High-Performance Embedded Architectures and Compilers I
An evaluation of the TRIPS computer system

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Complexity Effective Bypass Networks

Transactions on High-Performance Embedded Architectures and Compilers II
rMPI: message passing on multicore processors with on-chip interconnect

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
A dynamic dataflow architecture using partial reconfigurable hardware as an option for multiple cores

WSEAS Transactions on Computers
A pattern for efficient parallel computation on multicore processors with scalar operand networks

Proceedings of the 2010 Workshop on Parallel Programming Patterns
CRIB: consolidated rename, issue, and bypass

Proceedings of the 38th annual international symposium on Computer architecture
Design and analysis of adaptive processor

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Single FU bypass networks for high clock rate superscalar processors

HiPC'04 Proceedings of the 11th international conference on High Performance Computing
A scalable, multi-thread, multi-issue array processor architecture for DSP applications based on extended tomasulo scheme

SAMOS'06 Proceedings of the 6th international conference on Embedded Computer Systems: architectures, Modeling, and Simulation

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we survey the design space of a new class of architectures called Grid Processor Architectures (GPAs). These architectures are designed to scale with technology, allowing faster clock rates than conventional architectures while providing superior instruction-level parallelism on traditional workloads and high performance across a range of application classes. A GPA consists of an array of ALUs, each with limited control, connected by a thin operand network. Programs are executed by mapping blocks of statically scheduled instructions to the ALU array and executing them dynamically in dataflow order. This organization enables the critical paths of instruction blocks to be executed on chains of ALUs without transmitting temporary values back to the register file, avoiding most of the large, unscalable structures that limit the scalability of conventional architectures. Finally, we present simulation results of a preliminary design, the GPA-1. With a half-cycle routing delay, we obtain performance roughly equal to an ideal 8-way, 512-entry window superscalar core. With no inter-ALU delay, perfect memory, and perfect branch prediction, the IPC of the GPA-1 is more than twice that of the ideal superscalar core, achieving an average of 11 IPC across nine SPEC CPU2000 and Mediabench benchmarks.