Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams

Authors:
Michael Bedford Taylor;Walter Lee;Jason Miller;David Wentzlaff;Ian Bratt;Ben Greenwald;Henry Hoffmann;Paul Johnson;Jason Kim;James Psota;Arvind Saraf;Nathan Shnidman;Volker Strumpen;Matt Frank;Saman Amarasinghe;Anant Agarwal
Affiliations:
CSAIL, Massachusetts Institute of Technology;CSAIL, Massachusetts Institute of Technology;CSAIL, Massachusetts Institute of Technology;CSAIL, Massachusetts Institute of Technology;CSAIL, Massachusetts Institute of Technology;CSAIL, Massachusetts Institute of Technology;CSAIL, Massachusetts Institute of Technology;CSAIL, Massachusetts Institute of Technology;CSAIL, Massachusetts Institute of Technology;CSAIL, Massachusetts Institute of Technology;CSAIL, Massachusetts Institute of Technology;CSAIL, Massachusetts Institute of Technology;CSAIL, Massachusetts Institute of Technology;CSAIL, Massachusetts Institute of Technology;CSAIL, Massachusetts Institute of Technology;CSAIL, Massachusetts Institute of Technology
Venue:
Proceedings of the 31st annual international symposium on Computer architecture
Year:
2004

Citing 34
Cited 91

The warp computer: Architecture, implementation, and performance

IEEE Transactions on Computers
LAPACK: a portable linear algebra library for high-performance computers

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
The Stanford Dash Multiprocessor

Computer
The J-machine multicomputer: an architectural evaluation

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Multiscalar processors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
NuMesh: an architecture optimized for scheduled communication

The Journal of Supercomputing - Special issue on parallel and distributed processing
iWarp: anatomy of a parallel computing system

iWarp: anatomy of a parallel computing system
Space-time scheduling of instruction-level parallelism on a raw machine

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Maps: a compiler-managed memory system for raw machines

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
PipeRench: a co/processor for streaming multimedia acceleration

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Smart Memories: a modular reconfigurable architecture

Proceedings of the 27th annual international symposium on Computer architecture
Clock rate versus IPC: the end of the road for conventional microarchitectures

Proceedings of the 27th annual international symposium on Computer architecture
Modulo scheduling for a fully-distributed clustered VLIW architecture

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
SimpleFit: A Framework for Analyzing Design Trade-Offs in Raw Architectures

IEEE Transactions on Parallel and Distributed Systems
An instruction set and microarchitecture for instruction level distributed processing

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Tarantula: a vector extension to the alpha architecture

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A design space evaluation of grid processor architectures

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
A stream compiler for communication-exposed architectures

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Will Physical Scalability Sabotage Performance Gains?

Computer
Baring It All to Software: Raw Machines

Computer
A New Direction for Computer Architecture Research

Computer
The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs

IEEE Micro
StreamIt: A Language for Streaming Applications

CC '02 Proceedings of the 11th International Conference on Compiler Construction
Convergent scheduling

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
The RAW benchmark suite: computation structures for general purpose computing

FCCM '97 Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines
Garp: a MIPS processor with a reconfigurable coprocessor

FCCM '97 Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines
Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
The Imagine Stream Processor

ICCD '02 Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'02)
A performance analysis of PIM, stream processing, and tiled processing on memory-intensive signal processing kernels

Proceedings of the 30th annual international symposium on Computer architecture
Energy characterization of a tiled architecture processor with on-chip networks

Proceedings of the 2003 international symposium on Low power electronics and design
Complexity-effective superscalar processors

Complexity-effective superscalar processors
Integrated shared-memory and message-passing communication in the alewife multiprocessor

Integrated shared-memory and message-passing communication in the alewife multiprocessor
The Vector-Thread Architecture

Proceedings of the 31st annual international symposium on Computer architecture
MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research

IEEE Computer Architecture Letters

Thermal Modeling, Characterization and Management of On-Chip Networks

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Scalar Operand Networks

IEEE Transactions on Parallel and Distributed Systems
Technology-based Architectural Analysis of Operand Bypass Networks for Efficient Operand Transport

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 15 - Volume 16
A High Throughput String Matching Architecture for Intrusion Detection and Prevention

Proceedings of the 32nd annual international symposium on Computer Architecture
Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors

Proceedings of the 32nd annual international symposium on Computer Architecture
A reconfigurable architecture for load-balanced rendering

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Software-directed power-aware interconnection networks

Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems
Physical resource binding for a Coarse-Grain reconfigurable array using evolutionary algorithms

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A Distributed Control Path Architecture for VLIW Processors

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Hardware-modulated parallelism in chip multiprocessors

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
The design and implementation of a low-latency on-chip network

ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
Constructing Virtual Architectures on a Tiled Processor

Proceedings of the International Symposium on Code Generation and Optimization
Tile size selection for low-power tile-based architectures

Proceedings of the 3rd conference on Computing frontiers
Bit-split string-matching engines for intrusion detection and prevention

ACM Transactions on Architecture and Code Optimization (TACO)
Area-Performance Trade-offs in Tiled Dataflow Architectures

Proceedings of the 33rd annual international symposium on Computer Architecture
A case for chip multiprocessors based on the data-driven multithreading model

International Journal of Parallel Programming
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Tartan: evaluating spatial computation for whole program execution

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Software-based instruction caching for embedded processors

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
High-level power analysis for multi-core chips

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
CAPSULE: Hardware-Assisted Parallel Execution of Component-Based Programs

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Real-time rendering systems in 2010

SIGGRAPH '05 ACM SIGGRAPH 2005 Courses
ALP: Efficient support for all levels of parallelism for complex media applications

ACM Transactions on Architecture and Code Optimization (TACO)
Software-directed power-aware interconnection networks

ACM Transactions on Architecture and Code Optimization (TACO)
The WaveScalar architecture

ACM Transactions on Computer Systems (TOCS)
Express virtual channels: towards the ideal interconnection fabric

Proceedings of the 34th annual international symposium on Computer architecture
Comparing memory systems for chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
An accurate performance model of fully adaptive routing in wormhole-switched two-dimensional mesh multicomputers

Microprocessors & Microsystems
On-Chip Interconnection Architecture of the Tile Processor

IEEE Micro
High performance dense linear algebra on a spatially distributed processor

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Synchroscalar: Evaluation of an embedded, multi-core architecture for media applications

Journal of Embedded Computing - Issues in embedded single-chip multicore architectures
Corona: System Implications of Emerging Nanophotonic Technology

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Exact and approximate task assignment algorithms for pipelined software synthesis

Proceedings of the conference on Design, automation and test in Europe
Streaming Systems in FPGAs

SAMOS '08 Proceedings of the 8th international workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
Optimus: efficient realization of streaming applications on FPGAs

CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Application-specific Processor Architecture: Then and Now

Journal of Signal Processing Systems
Comparative evaluation of memory models for chip multiprocessors

ACM Transactions on Architecture and Code Optimization (TACO)
Throughput-driven synthesis of embedded software for pipelined execution on multicore architectures

ACM Transactions on Embedded Computing Systems (TECS)
Register Bank Assignment for Spatially Partitioned Processors

Languages and Compilers for Parallel Computing
MPSoC Design Using Application-Specific Architecturally Visible Communication

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Using Application Bisection Bandwidth to Guide Tile Size Selection for the Synchroscalar Tile-Based Architecture

Transactions on High-Performance Embedded Architectures and Compilers I
Token flow control

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Factored operating systems (fos): the case for a scalable operating system for multicores

ACM SIGOPS Operating Systems Review
Evolution in architectures and programming methodologies of coarse-grained reconfigurable computing

Microprocessors & Microsystems
Polaris: a system-level roadmapping toolchain for on-chip interconnection networks

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A case for bufferless routing in on-chip networks

Proceedings of the 36th annual international symposium on Computer architecture
Triplet-based topology for on-chip networks

WSEAS Transactions on Computers
FlexCore: Utilizing Exposed Datapath Control for Efficient Computing

Journal of Signal Processing Systems
Using a configurable processor generator for computer architecture prototyping

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
CGADL: an architecture description language for coarse-grained reconfigurable arrays

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
FinFET-based power simulator for interconnection networks

ACM Journal on Emerging Technologies in Computing Systems (JETC)
Conservation cores: reducing the energy of mature computations

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
A multi-core signal processor for heterogeneous reconfigurable computing

SOC'09 Proceedings of the 11th international conference on System-on-chip
FT64: scientific computing with streams

HiPC'07 Proceedings of the 14th international conference on High performance computing
rMPI: message passing on multicore processors with on-chip interconnect

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
FinFET-based dynamic power management of on-chip interconnection networks through adaptive back-gate biasing

ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Trace-driven optimization of networks-on-chip configurations

Proceedings of the 47th Design Automation Conference
Exploiting address compression and heterogeneous interconnects for efficient message management in tiled CMPs

Journal of Systems Architecture: the EUROMICRO Journal
ATAC: a 1000-core cache-coherent processor with on-chip optical network

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
ORION 2.0: a fast and accurate NoC power and area model for early-stage design space exploration

Proceedings of the Conference on Design, Automation and Test in Europe
Group-caching for NoC based multicore cache coherent systems

Proceedings of the Conference on Design, Automation and Test in Europe
Optimal synthesis of latency and throughput constrained pipelined MPSoCs targeting streaming applications

CODES/ISSS '10 Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Efficient address mapping of shared cache for on-chip many-core architecture

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Thread owned block cache: managing latency in many-core architecture

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
A configurable framework for stream programming exploration in baseband applications

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
An analytical network performance model for SIMD processor CSX600 interconnects

Journal of Systems Architecture: the EUROMICRO Journal
Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Streaming Data Movement for Real-Time Image Analysis

Journal of Signal Processing Systems
A pattern for efficient parallel computation on multicore processors with scalar operand networks

Proceedings of the 2010 Workshop on Parallel Programming Patterns
Embedding of a real time image stabilization algorithm on a parameterizable SoPC architecture a chip multi-processor approach

Journal of Real-Time Image Processing
Parkour: parallel speedup estimates for serial programs

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Multicore performance optimization using partner cores

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Kismet: parallel speedup estimates for serial programs

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
L2-Cache hierarchical organizations for multi-core architectures

ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking
Adaptive vision leveraging digital retinas: extracting meaningful segments

ACIVS'06 Proceedings of the 8th international conference on Advanced Concepts For Intelligent Vision Systems
A low-swing crossbar and link generator for low-power networks-on-chip

Proceedings of the International Conference on Computer-Aided Design
Efficient trace-driven metaheuristics for optimization of networks-on-chip configurations

Proceedings of the International Conference on Computer-Aided Design
Making-a-stop: A new bufferless routing algorithm for on-chip network

Journal of Parallel and Distributed Computing
DDM-CMP: data-driven multithreading on a chip multiprocessor

SAMOS'05 Proceedings of the 5th international conference on Embedded Computer Systems: architectures, Modeling, and Simulation
Mapping streaming languages to general purpose processors through vectorization

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Tiled multi-core stream architecture

Transactions on High-Performance Embedded Architectures and Compilers IV
Remote store programming: a memory model for embedded multicore

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Configurable fine-grain protection for multicore processor virtualization

Proceedings of the 39th Annual International Symposium on Computer Architecture
A coarse-grained reconfigurable architecture with compilation for high performance

International Journal of Reconfigurable Computing - Special issue on High-Performance Reconfigurable Computing
DeBAR: deflection based adaptive router with minimal buffering

Proceedings of the Conference on Design, Automation and Test in Europe
The von Neumann architecture is due for retirement

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
UNTANGLED: A Game Environment for Discovery of Creative Mapping Strategies

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Exploiting Task- and Data-Level Parallelism in Streaming Applications Implemented in FPGAs

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Hybrid compile and run-time memory management for a 3D-stacked reconfigurable accelerator

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
CAERUS: an effective arbitration and ejection policy for routing in an unidirectional torus

Proceedings of the 8th International Workshop on Interconnection Network Architecture: On-Chip, Multi-Chip
X-Network: An area-efficient and high-performance on-chip wormhole interconnect network

Microprocessors & Microsystems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper evaluates the Raw microprocessor. Raw addresses thechallenge of building a general-purpose architecture that performswell on a larger class of stream and embedded computing applicationsthan existing microprocessors, while still running existingILP-based sequential programs with reasonable performance in theface of increasing wire delays. Raw approaches this challenge byimplementing plenty of on-chip resources - including logic, wires,and pins - in a tiled arrangement, and exposing them through a newISA, so that the software can take advantage of these resources forparallel applications. Raw supports both ILP and streams by routingoperands between architecturally-exposed functional units overa point-to-point scalar operand network. This network offers lowlatency for scalar data transport. Raw manages the effect of wiredelays by exposing the interconnect and using software to orchestrateboth scalar and stream data transport.We have implemented a prototype Raw microprocessor in IBM's180 nm, 6-layer copper, CMOS 7SF standard-cell ASIC process. Wehave also implemented ILP and stream compilers. Our evaluationattempts to determine the extent to which Raw succeeds in meetingits goal of serving as a more versatile, general-purpose processor.Central to achieving this goal is Raw's ability to exploit all formsof parallelism, including ILP, DLP, TLP, and Stream parallelism.Specifically, we evaluate the performance of Raw on a diverse setof codes including traditional sequential programs, streaming applications,server workloads and bit-level embedded computation.Our experimental methodology makes use of a cycle-accurate simulatorvalidated against our real hardware. Compared to a 180 nmPentium-III, using commodity PC memory system components, Rawperforms within a factor of 2x for sequential applications with a verylow degree of ILP, about 2x to 9x better for higher levels of ILP, and10x-100x better when highly parallel applications are coded in astream language or optimized by hand. The paper also proposes anew versatility metric and uses it to discuss the generality of Raw.