Software pipelining: an effective scheduling technique for VLIW machines

Authors:
M. Lam
Affiliations:
Carnegie Mellon Univ., Pittsburgh, PA
Venue:
PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Year:
1988

Citing 20
Cited 360

Compilation for a high-performance systolic array

SIGPLAN '86 Proceedings of the 1986 SIGPLAN symposium on Compiler construction
Highly concurrent scalar processing

Highly concurrent scalar processing
URPR—An extension of URCR for software pipelining

MICRO 19 Proceedings of the 19th annual workshop on Microprogramming
A study of scalar compilation techniques for pipelined supercomputers

ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
A VLIW architecture for a trace scheduling compiler

ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
The warp computer: Architecture, implementation, and performance

IEEE Transactions on Computers
Compiler optimizations for asynchronous systolic array programs

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A compilation technique for software pipelining of loops with conditional jumps

MICRO 20 Proceedings of the 20th annual workshop on Microprogramming
GURPR—a method for global software pipelining

MICRO 20 Proceedings of the 20th annual workshop on Microprogramming
Algorithm 97: Shortest path

Communications of the ACM
Parallel processing: a smart compiler and a dumb machine

SIGPLAN '84 Proceedings of the 1984 SIGPLAN symposium on Compiler construction
A Fortran compiler for the FPS-164 scientific computer

SIGPLAN '84 Proceedings of the 1984 SIGPLAN symposium on Compiler construction
Dependence graphs and compiler optimizations

POPL '81 Proceedings of the 8th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing

MICRO 14 Proceedings of the 14th annual workshop on Microprogramming
Global optimization of microprograms through modular control constructs

MICRO 12 Proceedings of the 12th annual workshop on Microprogramming
Improving the throughput of a pipeline by insertion of delays

ISCA '76 Proceedings of the 3rd annual symposium on Computer architecture
An improvement of trace scheduling for global microcode compaction

MICRO 17 Proceedings of the 17th annual workshop on Microprogramming
The optimization of horizontal microcode within and beyond basic blocks: an application of processor scheduling with resources

The optimization of horizontal microcode within and beyond basic blocks: an application of processor scheduling with resources
Bulldog: a compiler for vliw architectures (parallel computing, reduced-instruction-set, trace scheduling, scientific)

Bulldog: a compiler for vliw architectures (parallel computing, reduced-instruction-set, trace scheduling, scientific)

Warp: an integrated solution of high-speed parallel computing

Proceedings of the 1988 ACM/IEEE conference on Supercomputing
Architecture and compiler tradeoffs for a long instruction wordprocessor

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Available instruction-level parallelism for superscalar and superpipelined machines

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
SIMP (Single Instruction stream/Multiple instruction Pipelining): a novel high-speed single-processor architecture

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
SIMP (Single Instruction stream/Multiple instruction Pipelining): a novel high-speed single-processor architecture

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Cost-effective design of application specific VLIW processors using the SCARCE framework

MICRO 22 Proceedings of the 22nd annual workshop on Microprogramming and microarchitecture
On optimal loop parallelization

MICRO 22 Proceedings of the 22nd annual workshop on Microprogramming and microarchitecture
A study of scalar compilation techniques for pipelined supercomputers

ACM Transactions on Mathematical Software (TOMS)
Automatic transformation of series expressions into loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
The floating point performance of a superscalar SPARC processor

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Limits of instruction-level parallelism

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Mapping concurrent programs to VLIW processors

PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
Parallelization of loops with exits on pipelined architectures

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
A timed Petri-net model for fine-grain loop scheduling

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Circular scheduling: a new technique to perform software pipelining

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Global instruction scheduling for superscalar machines

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
OHMEGA: a VLSI superscalar processor architecture for numerical applications

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
IMPACT: an architectural framework for multiple-instruction-issue processors

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
DSNS (dynamically-hazard-resolved statically-code-scheduled, nonuniform superscalar): yet another superscalar processor architecture

ACM SIGARCH Computer Architecture News
Comparing static and dynamic code scheduling for multiple-instruction-issue processors

MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
Software pipelining: an evaluation of enhanced pipelining

MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
Architecture and programming of a VLIW style programmable video signal processor

MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
Executing loops on a fine-grained MIMD architecture

MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
Unexpected side effects of inline substitution: a case study

ACM Letters on Programming Languages and Systems (LOPLAS)
An elementary processor architecture with simultaneous instruction issuing from multiple threads

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Processor coupling: integrating compile time and runtime scheduling for parallelism

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Abstractions for recursive pointer data structures: improving the analysis and transformation of imperative programs

PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Register allocation for software pipelined loops

PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Software support for speculative loads

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Predicting conditional branch directions from previous runs of a program

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Sentinel scheduling for VLIW and superscalar processors

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Effective compiler support for predicated execution using the hyperblock

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Enhanced region scheduling on a program dependence graph

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Microarchitecture support for dynamic scheduling of acyclic task graphs

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Code generation schema for modulo scheduled loops

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Enhanced modulo scheduling for loops with conditional branches

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
A dynamic-programming technique for compacting loops

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Performance evaluation of instruction scheduling on the IBM RISC System/6000

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Compiler code transformations for superscalar-based high performance systems

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Abstract description of pointer data structures: an approach for improving the analysis and optimization of imperative programs

ACM Letters on Programming Languages and Systems (LOPLAS)
Performance evaluation for various configuration of superscalar processors

ACM SIGARCH Computer Architecture News
Orchestrating interactions among parallel computations

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
A novel framework of register allocation for software pipelining

POPL '93 Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Sentinel scheduling: a model for compiler-controlled speculative execution

ACM Transactions on Computer Systems (TOCS)
Rotation scheduling: a loop pipelining algorithm

DAC '93 Proceedings of the 30th international Design Automation Conference
A scalar architecture for pseudo vector processing based on slide-windowed registers

ICS '93 Proceedings of the 7th international conference on Supercomputing
Effects of memory latencies on non-blocking processor/cache architectures

ICS '93 Proceedings of the 7th international conference on Supercomputing
Exploiting the parallelism available in loops

Computer
VLIW compilation techniques in a superscalar environment

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Iterative modulo scheduling: an algorithm for software pipelining loops

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Minimum register requirements for a modulo schedule

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Minimizing register requirements under resource-constrained rate-optimal software pipelining

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Instruction scheduling in the TOBEY compiler

IBM Journal of Research and Development
Compiler transformations for high-performance computing

ACM Computing Surveys (CSUR)
GURRR: a global unified resource requirements representation

IR '95 Papers from the 1995 ACM SIGPLAN workshop on Intermediate representations
Scheduling and mapping: software pipelining in the presence of structural hazards

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Software pipelining

ACM Computing Surveys (CSUR)
Rephasing: a transformation technique for the manipulation of timing constraints

DAC '95 Proceedings of the 32nd annual ACM/IEEE Design Automation Conference
Optimum modulo schedules for minimum register requirements

ICS '95 Proceedings of the 9th international conference on Supercomputing
The meeting graph: a new model for loop cyclic register allocation

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Automatic generation of loop scheduling for VLIW

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Modulo scheduling with multiple initiation intervals

Proceedings of the 28th annual international symposium on Microarchitecture
Region-based compilation: an introduction and motivation

Proceedings of the 28th annual international symposium on Microarchitecture
An experimental study of several cooperative register allocation and instruction scheduling strategies

Proceedings of the 28th annual international symposium on Microarchitecture
An effective programmable prefetch engine for on-chip caches

Proceedings of the 28th annual international symposium on Microarchitecture
Unrolling-based optimizations for modulo scheduling

Proceedings of the 28th annual international symposium on Microarchitecture
Stage scheduling: a technique to reduce the register requirements of a modulo schedule

Proceedings of the 28th annual international symposium on Microarchitecture
Hypernode reduction modulo scheduling

Proceedings of the 28th annual international symposium on Microarchitecture
Valid Transformations: A New Class of Loop Transformations for High-Level Synthesis and Pipelined Scheduling Applications

IEEE Transactions on Parallel and Distributed Systems
Software pipelining showdown: optimal vs. heuristic methods in a production compiler

PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
A reduced multipipeline machine description that preserves scheduling constraints

PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Exploiting dual data-memory banks in digital signal processors

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Data prefetching and multilevel blocking for linear algebra operations

ICS '96 Proceedings of the 10th international conference on Supercomputing
Block algorithms for sparse matrix computations on high performance workstations

ICS '96 Proceedings of the 10th international conference on Supercomputing
Trace-based program analysis

POPL '96 Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Modulo scheduling of loops in control-intensive non-numeric programs

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Heuristics for register-constrained software pipelining

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Software pipelining loops with conditional branches

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Combining loop transformations considering caches and scheduling

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Instruction scheduling for the HP PA-8000

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Meld scheduling: relaxing scheduling constraints across region boundaries

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
A Framework for Resource-Constrained Rate-Optimal Software Pipelining

IEEE Transactions on Parallel and Distributed Systems
Achieving Full Parallelism Using Multidimensional Retiming

IEEE Transactions on Parallel and Distributed Systems
Towards efficient fine-grain software pipelining

ICS '90 Proceedings of the 4th international conference on Supercomputing
Efficient scheduling of fine grain parallelism in loops

MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Employing finite automata for resource scheduling

MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
A software pipelining based VLIW architecture and optimizing compiler

MICRO 23 Proceedings of the 23rd annual workshop and symposium on Microprogramming and microarchitecture
Software pipelining: a comparison and improvement

MICRO 23 Proceedings of the 23rd annual workshop and symposium on Microprogramming and microarchitecture
Using a lookahead window in a compaction-based parallelizing compiler

MICRO 23 Proceedings of the 23rd annual workshop and symposium on Microprogramming and microarchitecture
High-level microprogramming: an optimizing C compiler for a processing element of a CAD accelerator

MICRO 23 Proceedings of the 23rd annual workshop and symposium on Microprogramming and microarchitecture
The 16-fold way: a microparallel taxonomy

MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Techniques for extracting instruction level parallelism on MIMD architectures

MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
A VLIW architecture based on shifting register files

MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Determining the Order of Processor Transactions in StaticallyScheduled Multiprocessors

Journal of VLSI Signal Processing Systems
CP-PACS: a massively parallel processor for large scale scientific calculations

ICS '97 Proceedings of the 11th international conference on Supercomputing
Exploiting instruction level parallelism in processors by caching scheduled groups

Proceedings of the 24th annual international symposium on Computer architecture
Tuning compiler optimizations for simultaneous multithreading

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Can program profiling support value prediction?

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Cache sensitive modulo scheduling

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Parallelizing nonnumerical code with selective scheduling and software pipelining

ACM Transactions on Programming Languages and Systems (TOPLAS)
Circuit Retiming Applied to Decomposed Software Pipelining

IEEE Transactions on Parallel and Distributed Systems
Compiler blockability of dense matrix factorizations

ACM Transactions on Mathematical Software (TOMS)
A general algorithm for tiling the register level

ICS '98 Proceedings of the 12th international conference on Supercomputing
Resource widening versus replication: limits and performance-cost trade-off

ICS '98 Proceedings of the 12th international conference on Supercomputing
The effect of instruction fetch bandwidth on value prediction

Proceedings of the 25th annual international symposium on Computer architecture
RECOD: a retiming heuristic to optimize resource and memory utilization in HW/SW codesigns

Proceedings of the 6th international workshop on Hardware/software codesign
Experiences with Cooperating Register Allocation and Instruction Scheduling

International Journal of Parallel Programming
Optimal Modulo Scheduling Through Enumeration

International Journal of Parallel Programming
Modulo Scheduling with Reduced Register Pressure

IEEE Transactions on Computers
IMPACT: an architectural framework for multiple-instruction-issue processors

25 years of the international symposia on Computer architecture (selected papers)
Reducing Data Hazards on Multi-pipelined DSP Architecture with Loop Scheduling

Journal of VLSI Signal Processing Systems - Special issue on future directions in the design and implementations of DSP systems
Analyzing Asynchronous Pipeline Schedules

International Journal of Parallel Programming
Quantitative Evaluation of Register Pressure on Software Pipelined Loops

International Journal of Parallel Programming
Using value prediction to increase the power of speculative execution hardware

ACM Transactions on Computer Systems (TOCS)
Split-path enhanced pipeline scheduling for loops with control flows

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Effective cluster assignment for modulo scheduling

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Widening resources: a cost-effective technique for aggressive ILP architectures

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Space-time scheduling of instruction-level parallelism on a raw machine

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Dependence based prefetching for linked data structures

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Resource constrained dataflow retiming heuristics for VLIW ASIPs

CODES '99 Proceedings of the seventh international workshop on Hardware/software codesign
Modulo scheduling for the TMS320C6x VLIW DSP architecture

Proceedings of the ACM SIGPLAN 1999 workshop on Languages, compilers, and tools for embedded systems
Boosting beyond static scheduling in a superscalar processor

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Probabilistic Loop Scheduling for Applications with Uncertain Execution Time

IEEE Transactions on Computers
Unroll-based register coalescing

Proceedings of the 14th international conference on Supercomputing
Function unit specialization through code analysis

ICCAD '99 Proceedings of the 1999 IEEE/ACM international conference on Computer-aided design
Tuning Compiler Optimizations for Simultaneous Multithreading

International Journal of Parallel Programming - Special issue on the 30th annual ACM/IEEE international symposium on microarchitecture, part II
Co-Synthesis to a Hybrid RISC/FPGA Architecture

Journal of VLSI Signal Processing Systems - Special issue on VLSI on custom computing technology
Supporting Timing Analysis by Automatic Bounding of LoopIterations

Real-Time Systems - Special issue on worst-case execution-time analysis
Matrix multiplication: a case study of enhanced data cache utilization

Journal of Experimental Algorithmics (JEA)
Properties and Algorithms for Unfolding of Probabilistic Data-Flow Graphs

Journal of VLSI Signal Processing Systems
Loop Shifting for Loop Compaction

International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, part 2
Properties of Rescheduling Size Invariance for Dynamic Rescheduling-Based VLIW Cross-Generation Compatibility

IEEE Transactions on Computers
Communication scheduling

ACM SIGPLAN Notices
Modulo scheduling for a fully-distributed clustered VLIW architecture

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Two-level hierarchical register file organization for VLIW processors

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Constraint analysis for code generation: basic techniques and applications in FACTS

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Hardware/software partitioning with integrated hardware design space exploration

Proceedings of the conference on Design, automation and test in Europe
Lifetime-Sensitive Modulo Scheduling in a Production Environment

IEEE Transactions on Computers
Automated synthesis of pipelined designs on FPGAs for signal and image processing applications described in MATLAB

Proceedings of the 2001 Asia and South Pacific Design Automation Conference
Compiler-based I/O prefetching for out-of-core applications

ACM Transactions on Computer Systems (TOCS)
Communication scheduling

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Speeding up control-dominated applications through microarchitectural customizations in embedded processors

Proceedings of the 38th annual Design Automation Conference
Power-aware modulo scheduling for high-performance VLIW processors

ISLPED '01 Proceedings of the 2001 international symposium on Low power electronics and design
Scheduling time-constrained instructions on pipelined processors

ACM Transactions on Programming Languages and Systems (TOPLAS)
Loop Transformations for Architectures with Partitioned Register Banks

OM '01 Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systems
Software Pipelining Irregular Loops On the TMS320C6000 VLIW DSP Architecture

OM '01 Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systems
FDRA: a software-pipelining algorithm for embedded VLIW processors

ISSS '00 Proceedings of the 13th international symposium on System synthesis
Instruction scheduling for clustered VLIW architectures

ISSS '00 Proceedings of the 13th international symposium on System synthesis
Code generation for embedded processors

ISSS '00 Proceedings of the 13th international symposium on System synthesis
ShiftQ: a bufferred interconnect for custom loop accelerators

CASES '01 Proceedings of the 2001 international conference on Compilers, architecture, and synthesis for embedded systems
Compiler-Assisted Multiple Instruction Word Retry for VLIW Architectures

IEEE Transactions on Parallel and Distributed Systems
Evaluating the Use of Register Queues in Software Pipelined Loops

IEEE Transactions on Computers - Special issue on the parallel architecture and compilation techniques conference
Automatic formal verification for scheduled VLIW code

Proceedings of the joint conference on Languages, compilers and tools for embedded systems: software and compilers for embedded systems
Loop fusion for clustered VLIW architectures

Proceedings of the joint conference on Languages, compilers and tools for embedded systems: software and compilers for embedded systems
Affinity-based cluster assignment for unrolled loops

ICS '02 Proceedings of the 16th international conference on Supercomputing
Optimal software pipelining of loops with control flows

ICS '02 Proceedings of the 16th international conference on Supercomputing
An interleaved cache clustered VLIW processor

ICS '02 Proceedings of the 16th international conference on Supercomputing
Modulo schedule buffers

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Graph-partitioning based instruction scheduling for clustered processors

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Enhancing loop buffering of media and telecommunications applications using low-overhead predication

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Embedded software in real-time signal processing systems: design technologies

Readings in hardware/software co-design
Constraint analysis for DSP code generation

Readings in hardware/software co-design
Register tiling in nonrectangular iteration spaces

ACM Transactions on Programming Languages and Systems (TOPLAS)
Optimal code size reduction for software-pipelined and unfolded loops

Proceedings of the 15th international symposium on System Synthesis
PACT HDL: a C compiler targeting ASICs and FPGAs with power and performance optimizations

CASES '02 Proceedings of the 2002 international conference on Compilers, architecture, and synthesis for embedded systems
On achieving balanced power consumption in software pipelined loops

CASES '02 Proceedings of the 2002 international conference on Compilers, architecture, and synthesis for embedded systems
TimeC: A Time Constraint Language for ILP Processor Compilation

Constraints
Minimizing Buffer Requirements under Rate-Optimal Schedule in Regular Dataflow Networks

Journal of VLSI Signal Processing Systems
Constraint satisfaction for relative location assignment and scheduling

Proceedings of the 2001 IEEE/ACM international conference on Computer-aided design
Hardware-Software partitioning and pipelined scheduling of transformative applications

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A Simulation Study of Decoupled Vector Architectures

The Journal of Supercomputing
Enhanced Co-Scheduling: A Software Pipelining Method Using Modulo-Scheduled Pipeline Theory

International Journal of Parallel Programming
Handling Global Constraints in Compiler Strategy

International Journal of Parallel Programming
A Vectorizing Compiler for Multimedia Extensions

International Journal of Parallel Programming
Meld Scheduling: A Technique for Relaxing Scheduling Constraints

International Journal of Parallel Programming
Combining Loop Transformations Considering Caches and Scheduling

International Journal of Parallel Programming
The Intel IA-64 Compiler Code Generator

IEEE Micro
Instruction Window Size Trade-Offs and Characterization of Program Parallelism

IEEE Transactions on Computers
Three Architectural Models for Compiler-Controlled Speculative Execution

IEEE Transactions on Computers
A Performance and Cost Analysis of Applying Superscalar Method to Mainframe Computers

IEEE Transactions on Computers
Unroll-Based Copy Elimination for Enhanced Pipeline Scheduling

IEEE Transactions on Computers
A Loop Transformation Theory and an Algorithm to Maximize Parallelism

IEEE Transactions on Parallel and Distributed Systems
Making Compaction-Based Parallelization Affordable

IEEE Transactions on Parallel and Distributed Systems
Generalized Multiway Branch Unit for VLIW Microprocessors

IEEE Transactions on Parallel and Distributed Systems
Heuristic Algorithms for Scheduling Iterative Task Computations on Distributed Memory Machines

IEEE Transactions on Parallel and Distributed Systems
Hypercube Algorithms on Mesh Connected Multicomputers

IEEE Transactions on Parallel and Distributed Systems
A finite state machine based format model of software pipelined loops with conditions

Progress in computer research
Probabilistic Rotation: Scheduling Graphs with Uncertain Execution Time

ICPP '97 Proceedings of the international Conference on Parallel Processing
Run-Time Support to Register Allocation for Loop Parallelization of Image Processing Programs

HPCN Europe 2000 Proceedings of the 8th International Conference on High-Performance Computing and Networking
Improving Code Efficiency for Reconfigurable VLIW Processors

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Efficient Pipelining of Nested Loops: Unroll-and-Squash

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
CPR: Mixed Task and Data Parallel Scheduling for Distributed Systems

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Unroll-Based Copy Elimination for Enhanced Pipeline Scheduling

LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
Loop Shifting for Loop Compaction

LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
Optimizing Loop Performance for Clustered VLIW Architectures

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Exploiting Pseudo-Schedules to Guide Data Dependence Graph Partitioning

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Influence of Variable Time Operations in Static Instruction Scheduling

Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
Software pipelining: A Genetic Algorithm Approach

PACT '94 Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques
Decomposed Software Pipelining: A New Approach to Exploit Instruction Level Parallelism for Loop Programs

PACT '93 Proceedings of the IFIP WG10.3. Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism
Software Pipelining: Petri Net Pacemaker

PACT '93 Proceedings of the IFIP WG10.3. Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism
Balancing Fine- and Medium-Grained Parallelism in Scheduling Loops for the XIMD Architecture

PACT '93 Proceedings of the IFIP WG10.3. Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism
Modeling Instruction-Level Parallelism for Software Pipelining

PACT '93 Proceedings of the IFIP WG10.3. Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism
Global Software Pipelining with Iteration Preselection

CC '00 Proceedings of the 9th International Conference on Compiler Construction
Software Pipelining of Nested Loops

CC '01 Proceedings of the 10th International Conference on Compiler Construction
A First Step Towards Time Optimal Software Pipelining of Loops with Control Flows

CC '01 Proceedings of the 10th International Conference on Compiler Construction
Reduced code size modulo scheduling in the absence of hardware support

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Memory layout techniques for variables utilizing efficient DRAM access modes in embedded system design

Proceedings of the 40th annual Design Automation Conference
Predicate-aware scheduling: a technique for reducing resource constraints

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Split-Path Enhanced Pipeline Scheduling

IEEE Transactions on Parallel and Distributed Systems
A compiler approach for reducing data cache energy

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
A new speculation technique to optimize floating-point performance while preserving bit-by-bit reproducibility

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Hades-towards the design of an asynchronous superscalar processor

ASYNC '95 Proceedings of the 2nd Working Conference on Asynchronous Design Methodologies
Architecture Design of Reconfigurable Pipelined Datapaths

ARVLSI '99 Proceedings of the 20th Anniversary Conference on Advanced Research in VLSI
Non-Consistent Dual Register Files to Reduce Register Pressure

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Decoupled vector architectures

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Co-Scheduling Hardware and Software Pipelines

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
The Architecture of Massively Parallel Processor CP-PACS

PAS '97 Proceedings of the 2nd AIZU International Symposium on Parallel Algorithms / Architecture Synthesis
The Effectiveness of Loop Unrolling for Modulo Scheduling in Clustered VLIW Architectures

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Efficient Scheduling of DSP Code on Processors with Distributed Register Files

Proceedings of the 12th international symposium on System synthesis
Jacobi Orderings for Multi-Port Hypercubes

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Register-Sensitive Software Pipelining

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
An Enhanced Co-Scheduling Method using Reduced MS-State Diagrams

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Strategies for Mapping Algorithms to Mediaprocessors for High Performance

IEEE Micro
Mapping of generalized template matching onto reconfigurable computers

IEEE Transactions on Very Large Scale Integration (VLSI) Systems - Special section on the 2001 international conference on computer design (ICCD)
Automatic exploration of VLIW processor architectures from a designer's experience based specification

CODES '94 Proceedings of the 3rd international workshop on Hardware/software co-design
Code size reduction technique and implementation for software-pipelined DSP applications

ACM Transactions on Embedded Computing Systems (TECS)
Automatic generation of application specific processors

Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
A timed Petri-net model for fine-grain loop scheduling

CASCON '91 Proceedings of the 1991 conference of the Centre for Advanced Studies on Collaborative research
Register allocation for optimal loop scheduling

CASCON '93 Proceedings of the 1993 conference of the Centre for Advanced Studies on Collaborative research: distributed computing - Volume 2
Loop Shifting and Compaction for the High-Level Synthesis of Designs with Complex Control Flow

Proceedings of the conference on Design, automation and test in Europe - Volume 1
Analysis and Modeling of Energy Reducing Source Code Transformations

Proceedings of the conference on Design, automation and test in Europe - Volume 3
Instruction Scheduling for Low Power

Journal of VLSI Signal Processing Systems
An experimental evaluation of scalar replacement on scientific benchmarks

Software—Practice & Experience
Application-domain-driven system design for pervasive video processing

Ambient intelligence
Register Constrained Modulo Scheduling

IEEE Transactions on Parallel and Distributed Systems
Code Generation for Single-Dimension Software Pipelining of Multi-Dimensional Loops

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Single-Dimension Software Pipelining for Multi-Dimensional Loops

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Probabilistic Predicate-Aware Modulo Scheduling

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
SPOT: development tool for software pipeline optimization for VLIW-DSPs used in real-time image processing

Real-Time Imaging - Special issue on software engineering
The design of dynamically reconfigurable datapath coprocessors

ACM Transactions on Embedded Computing Systems (TECS)
Field-testing IMPACT EPIC research results in Itanium 2

Proceedings of the 31st annual international symposium on Computer architecture
Time optimal software pipelining of loops with control flows

International Journal of Parallel Programming
Optimistic register coalescing

ACM Transactions on Programming Languages and Systems (TOPLAS)
Exploiting Loop-Level Parallelism on Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling

DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Fast and Accurate Multiprocessor Architecture Exploration with Symbolic Programs

DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Multithreaded Synchronous Data Flow Simulation

DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Static Placement, Dynamic Issue (SPDI) Scheduling for EDGE Architectures

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Instruction level parallelism of non-uniform acyclic loops

Journal of Computing Sciences in Colleges
Combining Extended Retiming and Unfolding for Rate-Optimal Graph Transformation

Journal of VLSI Signal Processing Systems
Register allocation for software pipelined multi-dimensional loops

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Automatically partitioning packet processing applications for pipelined architectures

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Complementing software pipelining with software thread integration

LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Automatic multithreading and multiprocessing of C programs for IXP

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
A reprogrammable customization framework for efficient branch resolution in embedded processors

ACM Transactions on Embedded Computing Systems (TECS)
Future wireless convergence platforms

CODES+ISSS '05 Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Cutpoints for formal equivalence verification of embedded software

Proceedings of the 5th ACM international conference on Embedded software
Reducing data cache leakage energy using a compiler-based approach

ACM Transactions on Embedded Computing Systems (TECS)
Deep Jam: Conversion of Coarse-Grain Parallelism to Instruction-Level and Vector Parallelism for Irregular Applications

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Using a lookahead window in a compaction-based parallelizing compiler

ACM SIGMICRO Newsletter
Automatic Thread Extraction with Decoupled Software Pipelining

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Exploiting Vector Parallelism in Software Pipelined Loops

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Compiler-directed high-level energy estimation and optimization

ACM Transactions on Embedded Computing Systems (TECS)
Software and hardware techniques to optimize register file utilization in VLIW architectures

International Journal of Parallel Programming
Combining extended retiming and unfolding for rate-optimal graph transformation

Journal of VLSI Signal Processing Systems
A new register file access architecture for software pipelining in VLIW processors

Proceedings of the 2005 Asia and South Pacific Design Automation Conference
Compiler transformations for effectively exploiting a zero overhead loop buffer

Software—Practice & Experience
Automatic instruction scheduler retargeting by reverse-engineering

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Generic software pipelining at the assembly level

SCOPES '05 Proceedings of the 2005 workshop on Software and compilers for embedded systems
Compiling for stream processing

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Reaching fast code faster: using modeling for efficient software thread integration on a VLIW DSP

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Embedded software verification using symbolic execution and uninterpreted functions

International Journal of Parallel Programming
Merging Head and Tail Duplication for Convergent Hyperblock Formation

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Single-dimension software pipelining for multidimensional loops

ACM Transactions on Architecture and Code Optimization (TACO)
Scheduling of Iterative Algorithms with Matrix Operations for Efficient FPGA Design--Implementation of Finite Interval Constant Modulus Algorithm

Journal of VLSI Signal Processing Systems
FEADS: a framework for exploring the application design space on network processors

International Journal of Parallel Programming
An Analytical Approach to Scheduling Code for Superscalar and VLIW Architectures

ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 01
Register pointer architecture for efficient embedded processors

Proceedings of the conference on Design, automation and test in Europe
Executing irregular scientific applications on stream architectures

Proceedings of the 21st annual international conference on Supercomputing
MPSoC memory optimization using program transformation

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Software optimization of video codecs on pentium processor with MMX technology

EURASIP Journal on Applied Signal Processing
Pfelib: a performance primitives library for embedded vision

EURASIP Journal on Embedded Systems
Facilitating compiler optimizations through the dynamic mapping of alternate register structures

CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems
Latency-tolerant software pipelining in a production compiler

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Algorithms and analysis of scheduling for loops with minimum switching

International Journal of Computational Science and Engineering
A new strategy for multiprocessor scheduling of cyclic task graphs

International Journal of High Performance Computing and Networking
Dynamic configuration of application-specific implicit instructions for embedded pipelined processors

Proceedings of the 2008 ACM symposium on Applied computing
Optimized mapping for enchancing the operation parallelism in coarse-grained reconfigurable arrays

SMO'06 Proceedings of the 6th WSEAS International Conference on Simulation, Modelling and Optimization
Rotating register allocation with multiple rotating branches

Proceedings of the 22nd annual international conference on Supercomputing
Post-pass periodic register allocation to minimise loop unrolling degree

Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on Languages, compilers, and tools for embedded systems
Placement-and-routing-based register allocation for coarse-grained reconfigurable arrays

Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on Languages, compilers, and tools for embedded systems
Prefabrication and postfabrication architecture exploration for partially reconfigurable VLIW processors

ACM Transactions on Embedded Computing Systems (TECS)
Register allocation for software pipelined multidimensional loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
VEAL: Virtualized Execution Accelerator for Loops

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Automatic architecture refinement techniques for customizing processing elements

Proceedings of the 45th annual Design Automation Conference
Automated dynamic throughput-constrained structural-level pipelining in streaming applications

Proceedings of the conference on Design, automation and test in Europe
Validating High-Level Synthesis

CAV '08 Proceedings of the 20th international conference on Computer Aided Verification
Stream Scheduling: A Framework to Manage Bulk Operations in Memory Hierarchies

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Timing optimization via nest-loop pipelining considering code size

Microprocessors & Microsystems
Integrated Modulo Scheduling for Clustered VLIW Architectures

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Software Pipelining in Nested Loops with Prolog-Epilog Merging

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Synthesis of reconfigurable high-performance multicore systems

Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Resource aware mapping on coarse grained reconfigurable arrays

Microprocessors & Microsystems
Design and implementation of a queue compiler

Microprocessors & Microsystems
Periodic register saturation in innermost loops

Parallel Computing
Improving performance of simple cores by exploiting loop-level parallelism through value prediction and reconfiguration

Proceedings of the 6th ACM conference on Computing frontiers
Compiler assisted architectural exploration framework for coarse grained reconfigurable arrays

The Journal of Supercomputing
Modulo scheduling without overlapped lifetimes

Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Mapping of nomadic multimedia applications on the ADRES reconfigurable array processor

Microprocessors & Microsystems
Design and Tool Flow of Multimedia MPSoC Platforms

Journal of Signal Processing Systems
Energy-Aware Loop Scheduling and Assignment for Multi-Core, Multi-Functional-Unit Architecture

Journal of Signal Processing Systems
A simple, verified validator for software pipelining

Proceedings of the 37th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Preprocessing strategy for effective modulo scheduling on multi-issue digital signal processors

CC'07 Proceedings of the 16th international conference on Compiler construction
Register allocation and optimal spill code scheduling in software pipelined loops using 0-1 integer linear programming formulation

CC'07 Proceedings of the 16th international conference on Compiler construction
Integrating high-level optimizations in a production compiler: design and implementation experience

CC'03 Proceedings of the 12th international conference on Compiler construction
MIRS: modulo scheduling with integrated register spilling

LCPC'01 Proceedings of the 14th international conference on Languages and compilers for parallel computing
Minimizing communication in rate-optimal software pipelining for stream programs

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Towards a source level compiler: source level modulo scheduling

Program analysis and compilation, theory and practice
Buffer-space efficient and deadlock-free scheduling of stream applications on multi-core architectures

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Translation validation of high-level synthesis

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Reducing Memory Constraints in Modulo Scheduling Synthesis for FPGAs

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
An Efficient Memory Organization for High-ILP Inner Modem Baseband SDR Processors

Journal of Signal Processing Systems
Fine-grain dynamic instruction placement for L0 scratch-pad memory

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Automatic memory partitioning: increasing memory parallelism via data structure partitioning

CODES/ISSS '10 Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Translation validation of loop optimizations and software pipelining in the TVOC framework: in memory of Amir Pnueli

SAS'10 Proceedings of the 17th international conference on Static analysis
Hierarchical multithreading: programming model and system software

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Design flow for optimizing performance in processor systems with on-chip coarse-grain reconfigurable logic

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Exploring the design space of an optimized compiler approach for mesh-like coarse-grained reconfigurable architectures

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
How many threads to spawn during program multithreading?

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Using a "codelet" program execution model for exascale machines: position paper

Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Precedence constraint posting for cyclic scheduling problems

CPAIOR'11 Proceedings of the 8th international conference on Integration of AI and OR techniques in constraint programming for combinatorial optimization problems
Natural instruction level parallelism-aware compiler for high-performance QueueCore processor architecture

The Journal of Supercomputing
Worst case analysis of decomposed software pipelining for cyclic unitary RCPSP with precedence delays

Journal of Scheduling
Improving performance through deep value profiling and specialization with code transformation

Computer Languages, Systems and Structures
Efficient Spilling Reduction for Software Pipelined Loops in Presence of Multiple Register Types in Embedded VLIW Processors

ACM Transactions on Embedded Computing Systems (TECS)
Register pressure in software-pipelined loop nests: fast computation and impact on architecture design

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Combined ILP and register tiling: analytical model and optimization framework

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Exploring the limits of GPGPU scheduling in control flow bound applications

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Instruction re-selection for iterative modulo scheduling on high performance multi-issue DSPs

EUC'06 Proceedings of the 2006 international conference on Emerging Directions in Embedded and Ubiquitous Computing
SCAN: a heuristic for near-optimal software pipelining

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Multi-dimensional kernel generation for loop nest software pipelining

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Task partitioning for multi-core network processors

CC'05 Proceedings of the 14th international conference on Compiler Construction
Trimaran: an infrastructure for research in instruction-level parallelism

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Using the meeting graph framework to minimise kernel loop unrolling for scheduled loops

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Single thread program parallelism with dataflow abstracting thread

ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
Integrated Code Generation for Loops

ACM Transactions on Embedded Computing Systems (TECS)
Scheduling expression DAGs for minimal register need

Computer Languages
Deadline constrained cyclic scheduling on pipelined dedicated processors considering multiprocessor tasks and changeover times

Mathematical and Computer Modelling: An International Journal
Automatic generation of software pipelines for heterogeneous parallel systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Cache-sensitive MapReduce DGEMM algorithms for shared memory architectures

Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference
Optimal and heuristic global code motion for minimal spilling

CC'13 Proceedings of the 22nd international conference on Compiler Construction
Near-Optimal Microprocessor and Accelerators Codesign with Latency and Throughput Constraints

ACM Transactions on Architecture and Code Optimization (TACO)
On-the-fly pipeline parallelism

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Software thread integration for instruction-level parallelism

ACM Transactions on Embedded Computing Systems (TECS)
A catalog of stream processing optimizations

ACM Computing Surveys (CSUR)
The benefits of using variable-length pipelined operations in high-level synthesis

ACM Transactions on Embedded Computing Systems (TECS)
Allocating rotating registers by scheduling

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Just-In-Time Software Pipelining

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
SDC-based modulo scheduling for pipeline synthesis

Proceedings of the International Conference on Computer-Aided Design
Predicate-aware, makespan-preserving software pipelining of scheduling tables

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.02

Visualization

Abstract

This paper shows that software pipelining is an effective and viable scheduling technique for VLIW processors. In software pipelining, iterations of a loop in the source program are continuously initiated at constant intervals, before the preceding iterations complete. The advantage of software pipelining is that optimal performance can be achieved with compact object code.This paper extends previous results of software pipelining in two ways: First, this paper shows that by using an improved algorithm, near-optimal performance can be obtained without specialized hardware. Second, we propose a hierarchical reduction scheme whereby entire control constructs are reduced to an object similar to an operation in a basic block. With this scheme, all innermost loops, including those containing conditional statements, can be software pipelined. It also diminishes the start-up cost of loops with small number of iterations. Hierarchical reduction complements the software pipelining technique, permitting a consistent performance improvement be obtained.The techniques proposed have been validated by an implementation of a compiler for Warp, a systolic array consisting of 10 VLIW processors. This compiler has been used for developing a large number of applications in the areas of image, signal and scientific processing.