Space-time scheduling of instruction-level parallelism on a raw machine

Authors:
Walter Lee;Rajeev Barua;Matthew Frank;Devabhaktuni Srikrishna;Jonathan Babb;Vivek Sarkar;Saman Amarasinghe
Affiliations:
M.I.T. Laboratory for Computer Science;M.I.T. Laboratory for Computer Science;M.I.T. Laboratory for Computer Science;M.I.T. Laboratory for Computer Science;M.I.T. Laboratory for Computer Science;M.I.T. Laboratory for Computer Science;M.I.T. Laboratory for Computer Science
Venue:
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Year:
1998

Citing 20
Cited 77

Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Limits of control flow on parallelism

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Partitioned register files for VLIWs: a preliminary analysis of tradeoffs

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
The multiflow trace scheduling compiler

The Journal of Supercomputing - Special issue on instruction-level parallelism
The superblock: an effective technique for VLIW and superscalar compilation

The Journal of Supercomputing - Special issue on instruction-level parallelism
List scheduling with and without communication delays

Parallel Computing
SUIF: an infrastructure for research on parallelizing and optimizing compilers

ACM SIGPLAN Notices
Multiscalar processors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
A framework for balancing control flow and predication

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Partitioning and Scheduling Parallel Programs for Multiprocessors

Partitioning and Scheduling Parallel Programs for Multiprocessors
Baring It All to Software: Raw Machines

Computer
Multiprocessors from a Software Perspective

IEEE Micro
DSC: Scheduling Parallel Tasks on an Unbounded Number of Processors

IEEE Transactions on Parallel and Distributed Systems
The RAW benchmark suite: computation structures for general purpose computing

FCCM '97 Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines
R. Barua, W. Lee, S. Amarasinghe and A. Agarwal

HIPC '98 Proceedings of the Fifth International Conference on High Performance Computing
Analysis, evaluation, and comparison of algorithms for scheduling task graphs on parallel processors

ISPAN '96 Proceedings of the 1996 International Symposium on Parallel Architectures, Algorithms and Networks
Maps: a Compiler-Managed Memory System for RAW Machines

Maps: a Compiler-Managed Memory System for RAW Machines
Bulldog: a compiler for vliw architectures (parallel computing, reduced-instruction-set, trace scheduling, scientific)

Bulldog: a compiler for vliw architectures (parallel computing, reduced-instruction-set, trace scheduling, scientific)
Instruction scheduling and fetch mechanisms for clustered vliw processors

Instruction scheduling and fetch mechanisms for clustered vliw processors
Logic emulation with virtual wires

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Maps: a compiler-managed memory system for raw machines

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Exploiting ILP in page-based intelligent memory

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
A C compiler for a processor with a reconfigurable functional unit

FPGA '00 Proceedings of the 2000 ACM/SIGDA eighth international symposium on Field programmable gate arrays
CHIMAERA: a high-performance architecture with a tightly-coupled reconfigurable functional unit

Proceedings of the 27th annual international symposium on Computer architecture
Bidwidth analysis with application to silicon compilation

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Attacking the semantic gap between application programming languages and configurable hardware

FPGA '01 Proceedings of the 2001 ACM/SIGDA ninth international symposium on Field programmable gate arrays
A decade of reconfigurable computing: a visionary retrospective

Proceedings of the conference on Design, automation and test in Europe
Reconfigurable computing: its concept and a practical embodiment using newly developed dynamically reconfigurable logic (DRL) LSI: invited talk

ASP-DAC '00 Proceedings of the 2000 Asia and South Pacific Design Automation Conference
Coarse grain reconfigurable architecture (embedded tutorial)

Proceedings of the 2001 Asia and South Pacific Design Automation Conference
A framework for reconfigurable computing: task scheduling and context management

IEEE Transactions on Very Large Scale Integration (VLSI) Systems - System Level Design
Compiler Support for Scalable and Efficient Memory Systems

IEEE Transactions on Computers
An instruction set and microarchitecture for instruction level distributed processing

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Direct addressed caches for reduced power consumption

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
A stream compiler for communication-exposed architectures

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Adaptive Multiuser Online Reconfigurable Engine

IEEE Design & Test
The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs

IEEE Micro
Convergent scheduling

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Parallelizing Applications into Silicon

FCCM '99 Proceedings of the Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Reducing Cost and Tolerating Defects in Page-based Intelligent Memory

ICCD '00 Proceedings of the 2000 IEEE International Conference on Computer Design: VLSI in Computers & Processors
Energy characterization of a tiled architecture processor with on-chip networks

Proceedings of the 2003 international symposium on Low power electronics and design
Cluster assignment of global values for clustered VLIW processors

Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
WaveScalar

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Network Topology Exploration of Mesh-Based Coarse-Grain Reconfigurable Architectures

Proceedings of the conference on Design, automation and test in Europe - Volume 1
Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams

Proceedings of the 31st annual international symposium on Computer architecture
Exploiting Loop-Level Parallelism on Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling

DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
High-level power analysis for on-chip networks

Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems
Spatial computation

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
An architecture and compiler for scalable on-chip communication

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Scalar Operand Networks

IEEE Transactions on Parallel and Distributed Systems
Software-directed power-aware interconnection networks

Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems
Automatic Thread Extraction with Decoupled Software Pipelining

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Compiler-directed Data Partitioning for Multicluster Processors

Proceedings of the International Symposium on Code Generation and Optimization
Area-Performance Trade-offs in Tiled Dataflow Architectures

Proceedings of the 33rd annual international symposium on Computer Architecture
Modeling instruction placement on a spatial architecture

Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures
Instruction scheduling for a tiled dataflow architecture

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Modulo graph embedding: mapping applications onto coarse-grained reconfigurable architectures

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Impact of intercluster communication mechanisms on ILP in clustered VLIW architectures

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Software-directed power-aware interconnection networks

ACM Transactions on Architecture and Code Optimization (TACO)
The WaveScalar architecture

ACM Transactions on Computer Systems (TOCS)
Inter-cluster communication in VLIW architectures

ACM Transactions on Architecture and Code Optimization (TACO)
Application driven embedded system design: a face recognition case study

CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems
Simultaneous dynamic voltage scaling of processors and communication links in real-time distributed embedded systems

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Data locality enhancement for CMPs

Proceedings of the 2007 IEEE/ACM international conference on Computer-aided design
Communication optimizations for global multi-threaded instruction scheduling

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Software-directed combined cpu/link voltage scaling fornoc-based cmps

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Application mapping for chip multiprocessors

Proceedings of the 45th annual Design Automation Conference
Edge-centric modulo scheduling for coarse-grained reconfigurable architectures

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
The StageNet fabric for constructing resilient multicore systems

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation

Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation
What the parallel-processing community has (failed) to offer the multi/many-core generation

Journal of Parallel and Distributed Computing
CGRA express: accelerating execution using dynamic operation fusion

CASES '09 Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems
Modern development methods and tools for embedded reconfigurable systems: A survey

Integration, the VLSI Journal
Polymorphic pipeline array: a flexible multicore accelerator with virtualized execution for mobile multimedia applications

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Compiling for reconfigurable computing: A survey

ACM Computing Surveys (CSUR)
Compiler directed network-on-chip reliability enhancement for chip multiprocessors

Proceedings of the ACM SIGPLAN/SIGBED 2010 conference on Languages, compilers, and tools for embedded systems
rMPI: message passing on multicore processors with on-chip interconnect

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
Partitioning composite web services for decentralized execution using a genetic algorithm

Future Generation Computer Systems
Resource recycling: putting idle resources to work on a composable accelerator

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Binary acceleration using coarse-grained reconfigurable architecture

ACM SIGARCH Computer Architecture News
A scheduling approach for distributed resource architectures with scarce communication resources

International Journal of High Performance Systems Architecture
A pattern for efficient parallel computation on multicore processors with scalar operand networks

Proceedings of the 2010 Workshop on Parallel Programming Patterns
A graph drawing based spatial mapping algorithm for coarse-grained reconfigurable architectures

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Kremlin: rethinking and rebooting gprof for the multicore age

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
CRIB: consolidated rename, issue, and bypass

Proceedings of the 38th annual international symposium on Computer architecture
Kismet: parallel speedup estimates for serial programs

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Routing-aware application mapping considering steiner points for coarse-grained reconfigurable architecture

ARC'10 Proceedings of the 6th international conference on Reconfigurable Computing: architectures, Tools and Applications
Synchroscalar: initial lessons in power-aware design of a tile-based embedded architecture

PACS'03 Proceedings of the Third international conference on Power - Aware Computer Systems
Memory-Aware application mapping on coarse-grained reconfigurable arrays

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
A systematic approach to classify design-time global scheduling techniques

ACM Computing Surveys (CSUR)
A general constraint-centric scheduling framework for spatial architectures

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Constraint centric scheduling guide

ACM SIGARCH Computer Architecture News
The von Neumann architecture is due for retirement

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
A constraint programming approach for integrated spatial and temporal scheduling for clustered architectures

ACM Transactions on Embedded Computing Systems (TECS)
CAeSaR: unified cluster-assignment scheduling and communication reuse for clustered VLIW processors

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
Design of a coarse-grained reconfigurable architecture with floating-point support and comparative study

Integration, the VLSI Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Increasing demand for both greater parallelism and faster clocks dictate that future generation architectures will need to decentralize their resources and eliminate primitives that require single cycle global communication. A Raw microprocessor distributes all of its resources, including instruction streams, register files, memory ports, and ALUs, over a pipelined two-dimensional mesh interconnect, and exposes them fully to the compiler. Because communication in Raw machines is distributed, compiling for instruction-level parallelism (ILP) requires both spatial instruction partitioning as well as traditional temporal instruction scheduling. In addition, the compiler must explicitly manage all communication through the interconnect, including the global synchronization required at branch points. This paper describes RAWCC, the compiler we have developed for compiling general-purpose sequential programs to the distributed Raw architecture. We present performance results that demonstrate that although Raw machines provide no mechanisms for global communication the Raw compiler can schedule to achieve speedups that scale with the number of available functional units.