Smart Memories: a modular reconfigurable architecture

Authors:
Ken Mai;Tim Paaske;Nuwan Jayasena;Ron Ho;William J. Dally;Mark Horowitz
Affiliations:
Computer Systems Laboratory, Stanford University, Stanford, California;Computer Systems Laboratory, Stanford University, Stanford, California;Computer Systems Laboratory, Stanford University, Stanford, California;Computer Systems Laboratory, Stanford University, Stanford, California;Computer Systems Laboratory, Stanford University, Stanford, California;Computer Systems Laboratory, Stanford University, Stanford, California
Venue:
Proceedings of the 27th annual international symposium on Computer architecture
Year:
2000

Citing 21
Cited 115

Evaluating stream buffers as a secondary cache replacement

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
DPGA utilization and application

Proceedings of the 1996 ACM fourth international symposium on Field-programmable gate arrays
Intel MMX for multimedia PCs

Communications of the ACM
A bandwidth-efficient architecture for media processing

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Data speculation support for a chip multiprocessor

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Improving the performance of speculatively parallel applications on the Hydra CMP

ICS '99 Proceedings of the 13th international conference on Supercomputing
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
A reconfigurable multi-function computing cache architecture

FPGA '00 Proceedings of the 2000 ACM/SIGDA eighth international symposium on Field programmable gate arrays
Will Physical Scalability Sabotage Performance Gains?

Computer
How Multimedia Workloads Will Change Processor Design

Computer
Scalable Processors in the Billion-Transistor Era: IRAM

Computer
A Single-Chip Multiprocessor

Computer
Baring It All to Software: Raw Machines

Computer
VIS Speeds New Media Processing

IEEE Micro
MMX Technology Extension to the Intel Architecture

IEEE Micro
Subword Parallelism with MAX-2

IEEE Micro
Hardware-Software Interactions on Mpact

IEEE Micro
The Chimaera reconfigurable functional unit

FCCM '97 Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines
Garp: a MIPS processor with a reconfigurable coprocessor

FCCM '97 Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines
How Useful Are Non-Blocking Loads, Stream Buffers and Speculative Execution in Multiple Issue Processors?

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Speculative Versioning Cache

HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture

Clock rate versus IPC: the end of the road for conventional microarchitectures

Proceedings of the 27th annual international symposium on Computer architecture
Automatic Code Mapping on an Intelligent Memory Architecture

IEEE Transactions on Computers
SaarCOR: a hardware architecture for ray tracing

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
A common machine language for grid-based architectures

ACM SIGARCH Computer Architecture News
Evolving RPC for active storage

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
A stream compiler for communication-exposed architectures

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Coping with Latency in SOC Design

IEEE Micro
A hierarchical three-way interconnect architecture for hexagonal processors

Proceedings of the 2003 international workshop on System-level interconnect prediction
A PIM-based Multiprocessor System

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Implementations of Real-time Data Intensive Applications on PIM-based Multiprocessor Systems

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Increasing and Detecting Memory Address Congruence

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
In-memory Parallelism for Database Workloads

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Compiling Application-Specific Hardware

FPL '02 Proceedings of the Reconfigurable Computing Is Going Mainstream, 12th International Conference on Field-Programmable Logic and Applications
A 64Mbit Mesochronous Hybrid Wave Pipelined Multibank DRAM Macro

IMS '00 Revised Papers from the Second International Workshop on Intelligent Memory Systems
The Characterization of Data Intensive Memory Workloads on Distributed PIM Systems

IMS '00 Revised Papers from the Second International Workshop on Intelligent Memory Systems
Energy/Performance Design of Memory Hierarchies for Processor-in-Memory Chips

IMS '00 Revised Papers from the Second International Workshop on Intelligent Memory Systems
Adaptively Mapping Code in an Intelligent Memory Architecture

IMS '00 Revised Papers from the Second International Workshop on Intelligent Memory Systems
FlexCache: A Framework for Flexible Compiler Generated Data Caching

IMS '00 Revised Papers from the Second International Workshop on Intelligent Memory Systems
Parallel ray tracing on a chip

Practical parallel rendering
High-level synthesis of distributed logic-memory architectures

Proceedings of the 2002 IEEE/ACM international conference on Computer-aided design
Programming the FlexRAM parallel intelligent memory system

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Exploring the VLSI Scalability of Stream Processors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Extending Platform-Based Design to Network on Chip Systems

VLSID '03 Proceedings of the 16th International Conference on VLSI Design
A highly configurable cache architecture for embedded systems

Proceedings of the 30th annual international symposium on Computer architecture
Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture

Proceedings of the 30th annual international symposium on Computer architecture
Guaranteeing the quality of services in networks on chip

Networks on chip
On packet switched networks for on-chip communication

Networks on chip
A parallel computer as a NOC region

Networks on chip
WaveScalar

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Universal Mechanisms for Data-Parallel Architectures

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Custom Data Layout for Memory Parallelism

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP

ACM Transactions on Architecture and Code Optimization (TACO)
Emerging software frameworks for exploiting Polymorphous Computing Architectures

OOPSLA '02 Companion of the 17th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Data forwarding through in-memory precomputation threads

Proceedings of the 18th annual international conference on Supercomputing
Synchroscalar: A Multiple Clock Domain, Power-Aware, Tile-Based Embedded Processor

Proceedings of the 31st annual international symposium on Computer architecture
Low-Latency Virtual-Channel Routers for On-Chip Networks

Proceedings of the 31st annual international symposium on Computer architecture
Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams

Proceedings of the 31st annual international symposium on Computer architecture
The Vector-Thread Architecture

Proceedings of the 31st annual international symposium on Computer architecture
Scaling to the End of Silicon with EDGE Architectures

Computer
Synthesis of Heterogeneous Distributed Architectures for Memory-Intensive Applications

Proceedings of the 2003 IEEE/ACM international conference on Computer-aided design
Spatial computation

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
The Vector-Thread Architecture

IEEE Micro
Scalar Operand Networks

IEEE Transactions on Parallel and Distributed Systems
An Application Analysis Framework For Polymorphic Chip Multiprocessors

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
PIM lite: a multithreaded processor-in-memory prototype

GLSVLSI '05 Proceedings of the 15th ACM Great Lakes symposium on VLSI
Realtime ray tracing of dynamic scenes on an FPGA chip

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
A highly configurable cache for low energy embedded systems

ACM Transactions on Embedded Computing Systems (TECS)
A High Throughput String Matching Architecture for Intrusion Detection and Prevention

Proceedings of the 32nd annual international symposium on Computer Architecture
RPU: a programmable ray processing unit for realtime ray tracing

ACM SIGGRAPH 2005 Papers
Reducing Server Data Traffic Using a Hierarchical Computation Model

IEEE Transactions on Parallel and Distributed Systems
A chip prototyping substrate: the flexible architecture for simulation and testing (FAST)

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Hardware-modulated parallelism in chip multiprocessors

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
High-level synthesis using computation-unit integrated memories

Proceedings of the 2004 IEEE/ACM International conference on Computer-aided design
Architectures for Bit-Split String Scanning in Intrusion Detection

IEEE Micro
The design and implementation of a low-latency on-chip network

ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
Constructing Virtual Architectures on a Tiled Processor

Proceedings of the International Symposium on Code Generation and Optimization
Data and Computation Transformations for Brook Streaming Applications on Multiprocessors

Proceedings of the International Symposium on Code Generation and Optimization
Tile size selection for low-power tile-based architectures

Proceedings of the 3rd conference on Computing frontiers
Bit-split string-matching engines for intrusion detection and prevention

ACM Transactions on Architecture and Code Optimization (TACO)
A survey of research and practices of Network-on-chip

ACM Computing Surveys (CSUR)
Area-Performance Trade-offs in Tiled Dataflow Architectures

Proceedings of the 33rd annual international symposium on Computer Architecture
Modeling instruction placement on a spatial architecture

Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures
Reducing control overhead in dataflow architectures

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Tartan: evaluating spatial computation for whole program execution

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
A defect tolerant self-organizing nanoscale SIMD architecture

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Real-time rendering systems in 2010

SIGGRAPH '05 ACM SIGGRAPH 2005 Courses
ALP: Efficient support for all levels of parallelism for complex media applications

ACM Transactions on Architecture and Code Optimization (TACO)
The WaveScalar architecture

ACM Transactions on Computer Systems (TOCS)
Scheduling threads for constructive cache sharing on CMPs

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Core fusion: accommodating software diversity in chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
Comparing memory systems for chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
A self-organizing defect tolerant SIMD architecture

ACM Journal on Emerging Technologies in Computing Systems (JETC)
Tradeoff between data-, instruction-, and thread-level parallelism in stream processors

Proceedings of the 21st annual international conference on Supercomputing
A low-cost mixed-mode parallel processor architecture for embedded systems

Proceedings of the 21st annual international conference on Supercomputing
Chip multi-processor generator

Proceedings of the 44th annual Design Automation Conference
A shared memory module for asynchronous arrays of processors

EURASIP Journal on Embedded Systems
Efficiency trends and limits from comprehensive microarchitectural adaptivity

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Destructive-read in embedded DRAM, impact on power consumption

Journal of Embedded Computing - Issues in embedded single-chip multicore architectures
Software-directed combined cpu/link voltage scaling fornoc-based cmps

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Generation of heterogeneous distributed architectures for memory-intensive applications through high-level synthesis

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Architecture and Evaluation of an Asynchronous Array of Simple Processors

Journal of Signal Processing Systems
Comparative evaluation of memory models for chip multiprocessors

ACM Transactions on Architecture and Code Optimization (TACO)
Compiler Controlled Speculation for Power Aware ILP Extraction in Dataflow Architectures

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Using Application Bisection Bandwidth to Guide Tile Size Selection for the Synchroscalar Tile-Based Architecture

Transactions on High-Performance Embedded Architectures and Compilers I
Verification of chip multiprocessor memory systems using a relaxed scoreboard

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Loop-Aware Instruction Scheduling with Dynamic Contention Tracking for Tiled Dataflow Architectures

CC '09 Proceedings of the 18th International Conference on Compiler Construction: Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009
Synthesis of predictable networks-on-chip-based interconnect architectures for chip multiprocessors

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A memory system design framework: creating smart memories

Proceedings of the 36th annual international symposium on Computer architecture
PLUG: flexible lookup modules for rapid deployment of new protocols in high-speed routers

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
A design methodology for domain-optimized power-efficient supercomputing

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Using a configurable processor generator for computer architecture prototyping

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
FinFET-based power simulator for interconnection networks

ACM Journal on Emerging Technologies in Computing Systems (JETC)
Chameleon: Virtualizing idle acceleration cores of a heterogeneous multicore processor for caching and prefetching

ACM Transactions on Architecture and Code Optimization (TACO)
rMPI: message passing on multicore processors with on-chip interconnect

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
FinFET-based dynamic power management of on-chip interconnection networks through adaptive back-gate biasing

ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Reconfiguration support for vector operations

International Journal of High Performance Systems Architecture
FPGA implementation of a configurable cache/scratchpad memory with virtualized user-level RDMA capability

SAMOS'09 Proceedings of the 9th international conference on Systems, architectures, modeling and simulation
CA-MPSoC: An automated design flow for predictable multi-processor architectures for multiple applications

Journal of Systems Architecture: the EUROMICRO Journal
Design and implementation of the PLUG architecture for programmable and efficient network lookups

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
An experimental study of optimizing bioinformatics applications

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A Predictive Model for Dynamic Microarchitectural Adaptivity Control

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
CoRAM: an in-fabric memory architecture for FPGA-based computing

Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
Variable tapered pareto buffer design and implementation allowing run-time configuration for low-power embedded SRAMs

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
The accelerator store: A shared memory framework for accelerator-based systems

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Cache write-back schemes for embedded destructive-read DRAM

ARCS'06 Proceedings of the 19th international conference on Architecture of Computing Systems
Application-aware deadlock-free oblivious routing based on extended turn-model

Proceedings of the International Conference on Computer-Aided Design
Tiled multi-core stream architecture

Transactions on High-Performance Embedded Architectures and Compilers IV
SCIN-cache: Fast speculative versioning in multithreaded cores

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
NP-SARC: Scalable network processing in the SARC multi-core FPGA platform

Journal of Systems Architecture: the EUROMICRO Journal
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Virtual networks -- distributed communication resource management

ACM Transactions on Reconfigurable Technology and Systems (TRETS) - Special Section on 19th Reconfigurable Architectures Workshop (RAW 2012)
Toward application-specific memory reconfiguration for energy efficiency

E2SC '13 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing
Dynamic microarchitectural adaptation using machine learning

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Trends in VLSI technology scaling demand that future computing devices be narrowly focused to achieve high performance and high efficiency, yet also target the high volumes and low costs of widely applicable general purpose designs. To address these conflicting requirements, we propose a modular reconfigurable architecture called Smart Memories, targeted at computing needs in the 0.1&mgr; technology generation. A Smart Memories chip is made up of many processing tiles, each containing local memory, local interconnect, and a processor core. For efficient computation under a wide class of possible applications, the memories, the wires, and the computational model can all be altered to match the applications. To show the applicability of this design, two very different machines at opposite ends of the architectural spectrum, the Imagine stream processor and the Hydra speculative multiprocessor, are mapped onto the Smart Memories computing substrate. Simulations of the mappings show that the Smart Memories architecture can successfully map these architectures with only modest performance degradation.