Software pipelining: an effective scheduling technique for VLIW machines
PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Limits of control flow on parallelism
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Partitioned register files for VLIWs: a preliminary analysis of tradeoffs
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
The multiflow trace scheduling compiler
The Journal of Supercomputing - Special issue on instruction-level parallelism
The superblock: an effective technique for VLIW and superscalar compilation
The Journal of Supercomputing - Special issue on instruction-level parallelism
List scheduling with and without communication delays
Parallel Computing
SUIF: an infrastructure for research on parallelizing and optimizing compilers
ACM SIGPLAN Notices
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
A framework for balancing control flow and predication
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Partitioning and Scheduling Parallel Programs for Multiprocessors
Partitioning and Scheduling Parallel Programs for Multiprocessors
Multiprocessors from a Software Perspective
IEEE Micro
DSC: Scheduling Parallel Tasks on an Unbounded Number of Processors
IEEE Transactions on Parallel and Distributed Systems
The RAW benchmark suite: computation structures for general purpose computing
FCCM '97 Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines
R. Barua, W. Lee, S. Amarasinghe and A. Agarwal
HIPC '98 Proceedings of the Fifth International Conference on High Performance Computing
Analysis, evaluation, and comparison of algorithms for scheduling task graphs on parallel processors
ISPAN '96 Proceedings of the 1996 International Symposium on Parallel Architectures, Algorithms and Networks
Maps: a Compiler-Managed Memory System for RAW Machines
Maps: a Compiler-Managed Memory System for RAW Machines
Bulldog: a compiler for vliw architectures (parallel computing, reduced-instruction-set, trace scheduling, scientific)
Instruction scheduling and fetch mechanisms for clustered vliw processors
Instruction scheduling and fetch mechanisms for clustered vliw processors
Logic emulation with virtual wires
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Maps: a compiler-managed memory system for raw machines
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Exploiting ILP in page-based intelligent memory
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
A C compiler for a processor with a reconfigurable functional unit
FPGA '00 Proceedings of the 2000 ACM/SIGDA eighth international symposium on Field programmable gate arrays
CHIMAERA: a high-performance architecture with a tightly-coupled reconfigurable functional unit
Proceedings of the 27th annual international symposium on Computer architecture
Bidwidth analysis with application to silicon compilation
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Attacking the semantic gap between application programming languages and configurable hardware
FPGA '01 Proceedings of the 2001 ACM/SIGDA ninth international symposium on Field programmable gate arrays
A decade of reconfigurable computing: a visionary retrospective
Proceedings of the conference on Design, automation and test in Europe
ASP-DAC '00 Proceedings of the 2000 Asia and South Pacific Design Automation Conference
Coarse grain reconfigurable architecture (embedded tutorial)
Proceedings of the 2001 Asia and South Pacific Design Automation Conference
A framework for reconfigurable computing: task scheduling and context management
IEEE Transactions on Very Large Scale Integration (VLSI) Systems - System Level Design
Compiler Support for Scalable and Efficient Memory Systems
IEEE Transactions on Computers
An instruction set and microarchitecture for instruction level distributed processing
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Direct addressed caches for reduced power consumption
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
A stream compiler for communication-exposed architectures
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Adaptive Multiuser Online Reconfigurable Engine
IEEE Design & Test
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Parallelizing Applications into Silicon
FCCM '99 Proceedings of the Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Reducing Cost and Tolerating Defects in Page-based Intelligent Memory
ICCD '00 Proceedings of the 2000 IEEE International Conference on Computer Design: VLSI in Computers & Processors
Energy characterization of a tiled architecture processor with on-chip networks
Proceedings of the 2003 international symposium on Low power electronics and design
Cluster assignment of global values for clustered VLIW processors
Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Network Topology Exploration of Mesh-Based Coarse-Grain Reconfigurable Architectures
Proceedings of the conference on Design, automation and test in Europe - Volume 1
Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams
Proceedings of the 31st annual international symposium on Computer architecture
DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
High-level power analysis for on-chip networks
Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
An architecture and compiler for scalable on-chip communication
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
IEEE Transactions on Parallel and Distributed Systems
Software-directed power-aware interconnection networks
Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems
Automatic Thread Extraction with Decoupled Software Pipelining
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Compiler-directed Data Partitioning for Multicluster Processors
Proceedings of the International Symposium on Code Generation and Optimization
Area-Performance Trade-offs in Tiled Dataflow Architectures
Proceedings of the 33rd annual international symposium on Computer Architecture
Modeling instruction placement on a spatial architecture
Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures
Instruction scheduling for a tiled dataflow architecture
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Modulo graph embedding: mapping applications onto coarse-grained reconfigurable architectures
CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Impact of intercluster communication mechanisms on ILP in clustered VLIW architectures
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Software-directed power-aware interconnection networks
ACM Transactions on Architecture and Code Optimization (TACO)
ACM Transactions on Computer Systems (TOCS)
Inter-cluster communication in VLIW architectures
ACM Transactions on Architecture and Code Optimization (TACO)
Application driven embedded system design: a face recognition case study
CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Data locality enhancement for CMPs
Proceedings of the 2007 IEEE/ACM international conference on Computer-aided design
Communication optimizations for global multi-threaded instruction scheduling
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Software-directed combined cpu/link voltage scaling fornoc-based cmps
SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Application mapping for chip multiprocessors
Proceedings of the 45th annual Design Automation Conference
Edge-centric modulo scheduling for coarse-grained reconfigurable architectures
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
The StageNet fabric for constructing resilient multicore systems
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation
Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation
What the parallel-processing community has (failed) to offer the multi/many-core generation
Journal of Parallel and Distributed Computing
CGRA express: accelerating execution using dynamic operation fusion
CASES '09 Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems
Modern development methods and tools for embedded reconfigurable systems: A survey
Integration, the VLSI Journal
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Compiling for reconfigurable computing: A survey
ACM Computing Surveys (CSUR)
Compiler directed network-on-chip reliability enhancement for chip multiprocessors
Proceedings of the ACM SIGPLAN/SIGBED 2010 conference on Languages, compilers, and tools for embedded systems
rMPI: message passing on multicore processors with on-chip interconnect
HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
Partitioning composite web services for decentralized execution using a genetic algorithm
Future Generation Computer Systems
Resource recycling: putting idle resources to work on a composable accelerator
CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Binary acceleration using coarse-grained reconfigurable architecture
ACM SIGARCH Computer Architecture News
A scheduling approach for distributed resource architectures with scarce communication resources
International Journal of High Performance Systems Architecture
A pattern for efficient parallel computation on multicore processors with scalar operand networks
Proceedings of the 2010 Workshop on Parallel Programming Patterns
A graph drawing based spatial mapping algorithm for coarse-grained reconfigurable architectures
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Kremlin: rethinking and rebooting gprof for the multicore age
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
CRIB: consolidated rename, issue, and bypass
Proceedings of the 38th annual international symposium on Computer architecture
Kismet: parallel speedup estimates for serial programs
Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
ARC'10 Proceedings of the 6th international conference on Reconfigurable Computing: architectures, Tools and Applications
Synchroscalar: initial lessons in power-aware design of a tile-based embedded architecture
PACS'03 Proceedings of the Third international conference on Power - Aware Computer Systems
Memory-Aware application mapping on coarse-grained reconfigurable arrays
HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
A systematic approach to classify design-time global scheduling techniques
ACM Computing Surveys (CSUR)
A general constraint-centric scheduling framework for spatial architectures
Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Constraint centric scheduling guide
ACM SIGARCH Computer Architecture News
The von Neumann architecture is due for retirement
HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
ACM Transactions on Embedded Computing Systems (TECS)
CAeSaR: unified cluster-assignment scheduling and communication reuse for clustered VLIW processors
Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
Integration, the VLSI Journal
Hi-index | 0.00 |
Increasing demand for both greater parallelism and faster clocks dictate that future generation architectures will need to decentralize their resources and eliminate primitives that require single cycle global communication. A Raw microprocessor distributes all of its resources, including instruction streams, register files, memory ports, and ALUs, over a pipelined two-dimensional mesh interconnect, and exposes them fully to the compiler. Because communication in Raw machines is distributed, compiling for instruction-level parallelism (ILP) requires both spatial instruction partitioning as well as traditional temporal instruction scheduling. In addition, the compiler must explicitly manage all communication through the interconnect, including the global synchronization required at branch points. This paper describes RAWCC, the compiler we have developed for compiling general-purpose sequential programs to the distributed Raw architecture. We present performance results that demonstrate that although Raw machines provide no mechanisms for global communication the Raw compiler can schedule to achieve speedups that scale with the number of available functional units.