The warp computer: Architecture, implementation, and performance
IEEE Transactions on Computers
LAPACK: a portable linear algebra library for high-performance computers
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
The Stanford Dash Multiprocessor
Computer
The J-machine multicomputer: an architectural evaluation
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
NuMesh: an architecture optimized for scheduled communication
The Journal of Supercomputing - Special issue on parallel and distributed processing
iWarp: anatomy of a parallel computing system
iWarp: anatomy of a parallel computing system
Space-time scheduling of instruction-level parallelism on a raw machine
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Maps: a compiler-managed memory system for raw machines
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
PipeRench: a co/processor for streaming multimedia acceleration
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Smart Memories: a modular reconfigurable architecture
Proceedings of the 27th annual international symposium on Computer architecture
Clock rate versus IPC: the end of the road for conventional microarchitectures
Proceedings of the 27th annual international symposium on Computer architecture
Modulo scheduling for a fully-distributed clustered VLIW architecture
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
SimpleFit: A Framework for Analyzing Design Trade-Offs in Raw Architectures
IEEE Transactions on Parallel and Distributed Systems
An instruction set and microarchitecture for instruction level distributed processing
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Tarantula: a vector extension to the alpha architecture
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A design space evaluation of grid processor architectures
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
A stream compiler for communication-exposed architectures
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
StreamIt: A Language for Streaming Applications
CC '02 Proceedings of the 11th International Conference on Compiler Construction
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
The RAW benchmark suite: computation structures for general purpose computing
FCCM '97 Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines
Garp: a MIPS processor with a reconfigurable coprocessor
FCCM '97 Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines
Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
ICCD '02 Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'02)
Proceedings of the 30th annual international symposium on Computer architecture
Energy characterization of a tiled architecture processor with on-chip networks
Proceedings of the 2003 international symposium on Low power electronics and design
Complexity-effective superscalar processors
Complexity-effective superscalar processors
Integrated shared-memory and message-passing communication in the alewife multiprocessor
Integrated shared-memory and message-passing communication in the alewife multiprocessor
The Vector-Thread Architecture
Proceedings of the 31st annual international symposium on Computer architecture
MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research
IEEE Computer Architecture Letters
Thermal Modeling, Characterization and Management of On-Chip Networks
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
IEEE Transactions on Parallel and Distributed Systems
Technology-based Architectural Analysis of Operand Bypass Networks for Efficient Operand Transport
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 15 - Volume 16
A High Throughput String Matching Architecture for Intrusion Detection and Prevention
Proceedings of the 32nd annual international symposium on Computer Architecture
Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors
Proceedings of the 32nd annual international symposium on Computer Architecture
A reconfigurable architecture for load-balanced rendering
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Software-directed power-aware interconnection networks
Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems
Physical resource binding for a Coarse-Grain reconfigurable array using evolutionary algorithms
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A Distributed Control Path Architecture for VLIW Processors
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Hardware-modulated parallelism in chip multiprocessors
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
The design and implementation of a low-latency on-chip network
ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
Constructing Virtual Architectures on a Tiled Processor
Proceedings of the International Symposium on Code Generation and Optimization
Tile size selection for low-power tile-based architectures
Proceedings of the 3rd conference on Computing frontiers
Bit-split string-matching engines for intrusion detection and prevention
ACM Transactions on Architecture and Code Optimization (TACO)
Area-Performance Trade-offs in Tiled Dataflow Architectures
Proceedings of the 33rd annual international symposium on Computer Architecture
A case for chip multiprocessors based on the data-driven multithreading model
International Journal of Parallel Programming
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Tartan: evaluating spatial computation for whole program execution
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Software-based instruction caching for embedded processors
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
High-level power analysis for multi-core chips
CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
CAPSULE: Hardware-Assisted Parallel Execution of Component-Based Programs
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Real-time rendering systems in 2010
SIGGRAPH '05 ACM SIGGRAPH 2005 Courses
ALP: Efficient support for all levels of parallelism for complex media applications
ACM Transactions on Architecture and Code Optimization (TACO)
Software-directed power-aware interconnection networks
ACM Transactions on Architecture and Code Optimization (TACO)
ACM Transactions on Computer Systems (TOCS)
Express virtual channels: towards the ideal interconnection fabric
Proceedings of the 34th annual international symposium on Computer architecture
Comparing memory systems for chip multiprocessors
Proceedings of the 34th annual international symposium on Computer architecture
Microprocessors & Microsystems
High performance dense linear algebra on a spatially distributed processor
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Synchroscalar: Evaluation of an embedded, multi-core architecture for media applications
Journal of Embedded Computing - Issues in embedded single-chip multicore architectures
Corona: System Implications of Emerging Nanophotonic Technology
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Exact and approximate task assignment algorithms for pipelined software synthesis
Proceedings of the conference on Design, automation and test in Europe
SAMOS '08 Proceedings of the 8th international workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
Optimus: efficient realization of streaming applications on FPGAs
CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Application-specific Processor Architecture: Then and Now
Journal of Signal Processing Systems
Comparative evaluation of memory models for chip multiprocessors
ACM Transactions on Architecture and Code Optimization (TACO)
Throughput-driven synthesis of embedded software for pipelined execution on multicore architectures
ACM Transactions on Embedded Computing Systems (TECS)
Register Bank Assignment for Spatially Partitioned Processors
Languages and Compilers for Parallel Computing
MPSoC Design Using Application-Specific Architecturally Visible Communication
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Transactions on High-Performance Embedded Architectures and Compilers I
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Factored operating systems (fos): the case for a scalable operating system for multicores
ACM SIGOPS Operating Systems Review
Evolution in architectures and programming methodologies of coarse-grained reconfigurable computing
Microprocessors & Microsystems
Polaris: a system-level roadmapping toolchain for on-chip interconnection networks
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A case for bufferless routing in on-chip networks
Proceedings of the 36th annual international symposium on Computer architecture
Triplet-based topology for on-chip networks
WSEAS Transactions on Computers
FlexCore: Utilizing Exposed Datapath Control for Efficient Computing
Journal of Signal Processing Systems
Using a configurable processor generator for computer architecture prototyping
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
CGADL: an architecture description language for coarse-grained reconfigurable arrays
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
FinFET-based power simulator for interconnection networks
ACM Journal on Emerging Technologies in Computing Systems (JETC)
Conservation cores: reducing the energy of mature computations
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
A multi-core signal processor for heterogeneous reconfigurable computing
SOC'09 Proceedings of the 11th international conference on System-on-chip
FT64: scientific computing with streams
HiPC'07 Proceedings of the 14th international conference on High performance computing
rMPI: message passing on multicore processors with on-chip interconnect
HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Trace-driven optimization of networks-on-chip configurations
Proceedings of the 47th Design Automation Conference
Journal of Systems Architecture: the EUROMICRO Journal
ATAC: a 1000-core cache-coherent processor with on-chip optical network
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
ORION 2.0: a fast and accurate NoC power and area model for early-stage design space exploration
Proceedings of the Conference on Design, Automation and Test in Europe
Group-caching for NoC based multicore cache coherent systems
Proceedings of the Conference on Design, Automation and Test in Europe
CODES/ISSS '10 Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Efficient address mapping of shared cache for on-chip many-core architecture
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Thread owned block cache: managing latency in many-core architecture
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
A configurable framework for stream programming exploration in baseband applications
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
An analytical network performance model for SIMD processor CSX600 interconnects
Journal of Systems Architecture: the EUROMICRO Journal
Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Streaming Data Movement for Real-Time Image Analysis
Journal of Signal Processing Systems
A pattern for efficient parallel computation on multicore processors with scalar operand networks
Proceedings of the 2010 Workshop on Parallel Programming Patterns
Journal of Real-Time Image Processing
Parkour: parallel speedup estimates for serial programs
HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Multicore performance optimization using partner cores
HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Kismet: parallel speedup estimates for serial programs
Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
L2-Cache hierarchical organizations for multi-core architectures
ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking
Adaptive vision leveraging digital retinas: extracting meaningful segments
ACIVS'06 Proceedings of the 8th international conference on Advanced Concepts For Intelligent Vision Systems
A low-swing crossbar and link generator for low-power networks-on-chip
Proceedings of the International Conference on Computer-Aided Design
Efficient trace-driven metaheuristics for optimization of networks-on-chip configurations
Proceedings of the International Conference on Computer-Aided Design
Making-a-stop: A new bufferless routing algorithm for on-chip network
Journal of Parallel and Distributed Computing
DDM-CMP: data-driven multithreading on a chip multiprocessor
SAMOS'05 Proceedings of the 5th international conference on Embedded Computer Systems: architectures, Modeling, and Simulation
Mapping streaming languages to general purpose processors through vectorization
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Tiled multi-core stream architecture
Transactions on High-Performance Embedded Architectures and Compilers IV
Remote store programming: a memory model for embedded multicore
HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Configurable fine-grain protection for multicore processor virtualization
Proceedings of the 39th Annual International Symposium on Computer Architecture
A coarse-grained reconfigurable architecture with compilation for high performance
International Journal of Reconfigurable Computing - Special issue on High-Performance Reconfigurable Computing
DeBAR: deflection based adaptive router with minimal buffering
Proceedings of the Conference on Design, Automation and Test in Europe
The von Neumann architecture is due for retirement
HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
UNTANGLED: A Game Environment for Discovery of Creative Mapping Strategies
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Exploiting Task- and Data-Level Parallelism in Streaming Applications Implemented in FPGAs
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Hybrid compile and run-time memory management for a 3D-stacked reconfigurable accelerator
Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
CAERUS: an effective arbitration and ejection policy for routing in an unidirectional torus
Proceedings of the 8th International Workshop on Interconnection Network Architecture: On-Chip, Multi-Chip
X-Network: An area-efficient and high-performance on-chip wormhole interconnect network
Microprocessors & Microsystems
Hi-index | 0.00 |
This paper evaluates the Raw microprocessor. Raw addresses thechallenge of building a general-purpose architecture that performswell on a larger class of stream and embedded computing applicationsthan existing microprocessors, while still running existingILP-based sequential programs with reasonable performance in theface of increasing wire delays. Raw approaches this challenge byimplementing plenty of on-chip resources - including logic, wires,and pins - in a tiled arrangement, and exposing them through a newISA, so that the software can take advantage of these resources forparallel applications. Raw supports both ILP and streams by routingoperands between architecturally-exposed functional units overa point-to-point scalar operand network. This network offers lowlatency for scalar data transport. Raw manages the effect of wiredelays by exposing the interconnect and using software to orchestrateboth scalar and stream data transport.We have implemented a prototype Raw microprocessor in IBM's180 nm, 6-layer copper, CMOS 7SF standard-cell ASIC process. Wehave also implemented ILP and stream compilers. Our evaluationattempts to determine the extent to which Raw succeeds in meetingits goal of serving as a more versatile, general-purpose processor.Central to achieving this goal is Raw's ability to exploit all formsof parallelism, including ILP, DLP, TLP, and Stream parallelism.Specifically, we evaluate the performance of Raw on a diverse setof codes including traditional sequential programs, streaming applications,server workloads and bit-level embedded computation.Our experimental methodology makes use of a cycle-accurate simulatorvalidated against our real hardware. Compared to a 180 nmPentium-III, using commodity PC memory system components, Rawperforms within a factor of 2x for sequential applications with a verylow degree of ILP, about 2x to 9x better for higher levels of ILP, and10x-100x better when highly parallel applications are coded in astream language or optimized by hand. The paper also proposes anew versatility metric and uses it to discuss the generality of Raw.