The Manchester prototype dataflow computer
Communications of the ACM - Special section on computer architecture
Evaluation of a prototype data flow processor of the SIGMA-1 for scientific computations
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Deadlock-Free Message Routing in Multiprocessor Interconnection Networks
IEEE Transactions on Computers
Efficient synchronization primitives for large-scale cache-coherent multiprocessors
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
I-structures: data structures for parallel computing
ACM Transactions on Programming Languages and Systems (TOPLAS)
The Epsilon dataflow processor
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
Efficiently computing static single assignment form and the control dependence graph
ACM Transactions on Programming Languages and Systems (TOPLAS)
Multithreading: a revisionist view of dataflow architectures
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Effective compiler support for predicated execution using the hyperblock
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Pthreads programming
Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading
ACM Transactions on Computer Systems (TOCS)
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Memory dependence prediction using store sets
Proceedings of the 25th annual international symposium on Computer architecture
Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor
Proceedings of the 25th annual international symposium on Computer architecture
Space-time scheduling of instruction-level parallelism on a raw machine
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Monsoon: an explicit token-store architecture
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Smart Memories: a modular reconfigurable architecture
Proceedings of the 27th annual international symposium on Computer architecture
Clock rate versus IPC: the end of the road for conventional microarchitectures
Proceedings of the 27th annual international symposium on Computer architecture
Piranha: a scalable architecture based on single-chip multiprocessing
Proceedings of the 27th annual international symposium on Computer architecture
NanoFabrics: spatial computing using molecular electronics
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A design space evaluation of grid processor architectures
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
IEEE Micro
A preliminary architecture for a basic data-flow processor
ISCA '75 Proceedings of the 2nd annual symposium on Computer architecture
DDDP-a Distributed Data Driven Processor
ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
The architecture and system method of DDM1: A recursively structured Data Driven Machine
ISCA '78 Proceedings of the 5th annual symposium on Computer architecture
Supporting Fine-Grained Synchronization on a Simultaneous Multithreading Processor
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture
Proceedings of the 30th annual international symposium on Computer architecture
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams
Proceedings of the 31st annual international symposium on Computer architecture
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Compiling for EDGE Architectures
Proceedings of the International Symposium on Code Generation and Optimization
Area-Performance Trade-offs in Tiled Dataflow Architectures
Proceedings of the 33rd annual international symposium on Computer Architecture
Modeling instruction placement on a spatial architecture
Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures
Reducing control overhead in dataflow architectures
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Instruction scheduling for a tiled dataflow architecture
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
The wavescalar architecture
Area-Performance Trade-offs in Tiled Dataflow Architectures
Proceedings of the 33rd annual international symposium on Computer Architecture
Design and implementation of a queue compiler
Microprocessors & Microsystems
Compiling Techniques for Coarse Grained Runtime Reconfigurable Architectures
ARC '09 Proceedings of the 5th International Workshop on Reconfigurable Computing: Architectures, Tools and Applications
REDEFINE: Runtime reconfigurable polymorphic ASIC
ACM Transactions on Embedded Computing Systems (TECS)
Using a configurable processor generator for computer architecture prototyping
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Conservation cores: reducing the energy of mature computations
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Strategies for compiling µTC to novel chip Multiprocessors
SAMOS'07 Proceedings of the 7th international conference on Embedded computer systems: architectures, modeling, and simulation
RETHROTTLE: execution throttling in the REDEFINE SoC architecture
SAMOS'09 Proceedings of the 9th international conference on Systems, architectures, modeling and simulation
WSEAS Transactions on Computers
Triggered instructions: a control paradigm for spatially-programmed architectures
Proceedings of the 40th Annual International Symposium on Computer Architecture
HARS: A hardware-assisted runtime software for embedded many-core architectures
ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Q100: the architecture and design of a database processing unit
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Hi-index | 0.00 |
Silicon technology will continue to provide an exponential increase in the availability of raw transistors. Effectively translating this resource into application performance, however, is an open challenge that conventional superscalar designs will not be able to meet. We present WaveScalar as a scalable alternative to conventional designs. WaveScalar is a dataflow instruction set and execution model designed for scalable, low-complexity/high-performance processors. Unlike previous dataflow machines, WaveScalar can efficiently provide the sequential memory semantics that imperative languages require. To allow programmers to easily express parallelism, WaveScalar supports pthread-style, coarse-grain multithreading and dataflow-style, fine-grain threading. In addition, it permits blending the two styles within an application, or even a single function. To execute WaveScalar programs, we have designed a scalable, tile-based processor architecture called the WaveCache. As a program executes, the WaveCache maps the program's instructions onto its array of processing elements (PEs). The instructions remain at their processing elements for many invocations, and as the working set of instructions changes, the WaveCache removes unused instructions and maps new ones in their place. The instructions communicate directly with one another over a scalable, hierarchical on-chip interconnect, obviating the need for long wires and broadcast communication. This article presents the WaveScalar instruction set and evaluates a simulated implementation based on current technology. For single-threaded applications, the WaveCache achieves performance on par with conventional processors, but in less area. For coarse-grain threaded applications the WaveCache achieves nearly linear speedup with up to 64 threads and can sustain 7--14 multiply-accumulates per cycle on fine-grain threaded versions of well-known kernels. Finally, we apply both styles of threading to equake from Spec2000 and speed it up by 9x compared to the serial version.