Compilation for a high-performance systolic array
SIGPLAN '86 Proceedings of the 1986 SIGPLAN symposium on Compiler construction
Highly concurrent scalar processing
Highly concurrent scalar processing
URPR—An extension of URCR for software pipelining
MICRO 19 Proceedings of the 19th annual workshop on Microprogramming
A study of scalar compilation techniques for pipelined supercomputers
ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
A VLIW architecture for a trace scheduling compiler
ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
The warp computer: Architecture, implementation, and performance
IEEE Transactions on Computers
Compiler optimizations for asynchronous systolic array programs
POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A compilation technique for software pipelining of loops with conditional jumps
MICRO 20 Proceedings of the 20th annual workshop on Microprogramming
GURPR—a method for global software pipelining
MICRO 20 Proceedings of the 20th annual workshop on Microprogramming
Communications of the ACM
Parallel processing: a smart compiler and a dumb machine
SIGPLAN '84 Proceedings of the 1984 SIGPLAN symposium on Compiler construction
A Fortran compiler for the FPS-164 scientific computer
SIGPLAN '84 Proceedings of the 1984 SIGPLAN symposium on Compiler construction
Dependence graphs and compiler optimizations
POPL '81 Proceedings of the 8th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Computers and Intractability: A Guide to the Theory of NP-Completeness
Computers and Intractability: A Guide to the Theory of NP-Completeness
MICRO 14 Proceedings of the 14th annual workshop on Microprogramming
Global optimization of microprograms through modular control constructs
MICRO 12 Proceedings of the 12th annual workshop on Microprogramming
Improving the throughput of a pipeline by insertion of delays
ISCA '76 Proceedings of the 3rd annual symposium on Computer architecture
An improvement of trace scheduling for global microcode compaction
MICRO 17 Proceedings of the 17th annual workshop on Microprogramming
The optimization of horizontal microcode within and beyond basic blocks: an application of processor scheduling with resources
Bulldog: a compiler for vliw architectures (parallel computing, reduced-instruction-set, trace scheduling, scientific)
Warp: an integrated solution of high-speed parallel computing
Proceedings of the 1988 ACM/IEEE conference on Supercomputing
Architecture and compiler tradeoffs for a long instruction wordprocessor
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Available instruction-level parallelism for superscalar and superpipelined machines
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Cost-effective design of application specific VLIW processors using the SCARCE framework
MICRO 22 Proceedings of the 22nd annual workshop on Microprogramming and microarchitecture
On optimal loop parallelization
MICRO 22 Proceedings of the 22nd annual workshop on Microprogramming and microarchitecture
A study of scalar compilation techniques for pipelined supercomputers
ACM Transactions on Mathematical Software (TOMS)
Automatic transformation of series expressions into loops
ACM Transactions on Programming Languages and Systems (TOPLAS)
The floating point performance of a superscalar SPARC processor
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Limits of instruction-level parallelism
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Mapping concurrent programs to VLIW processors
PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
Parallelization of loops with exits on pipelined architectures
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
A timed Petri-net model for fine-grain loop scheduling
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Circular scheduling: a new technique to perform software pipelining
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Global instruction scheduling for superscalar machines
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
OHMEGA: a VLSI superscalar processor architecture for numerical applications
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
IMPACT: an architectural framework for multiple-instruction-issue processors
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
ACM SIGARCH Computer Architecture News
Comparing static and dynamic code scheduling for multiple-instruction-issue processors
MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
Software pipelining: an evaluation of enhanced pipelining
MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
Architecture and programming of a VLIW style programmable video signal processor
MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
Executing loops on a fine-grained MIMD architecture
MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
Unexpected side effects of inline substitution: a case study
ACM Letters on Programming Languages and Systems (LOPLAS)
An elementary processor architecture with simultaneous instruction issuing from multiple threads
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Processor coupling: integrating compile time and runtime scheduling for parallelism
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Register allocation for software pipelined loops
PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Software support for speculative loads
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Design and evaluation of a compiler algorithm for prefetching
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Predicting conditional branch directions from previous runs of a program
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Sentinel scheduling for VLIW and superscalar processors
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Effective compiler support for predicated execution using the hyperblock
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Enhanced region scheduling on a program dependence graph
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Microarchitecture support for dynamic scheduling of acyclic task graphs
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Code generation schema for modulo scheduled loops
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Enhanced modulo scheduling for loops with conditional branches
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
A dynamic-programming technique for compacting loops
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Performance evaluation of instruction scheduling on the IBM RISC System/6000
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Compiler code transformations for superscalar-based high performance systems
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
ACM Letters on Programming Languages and Systems (LOPLAS)
Performance evaluation for various configuration of superscalar processors
ACM SIGARCH Computer Architecture News
Orchestrating interactions among parallel computations
PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
A novel framework of register allocation for software pipelining
POPL '93 Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Sentinel scheduling: a model for compiler-controlled speculative execution
ACM Transactions on Computer Systems (TOCS)
Rotation scheduling: a loop pipelining algorithm
DAC '93 Proceedings of the 30th international Design Automation Conference
A scalar architecture for pseudo vector processing based on slide-windowed registers
ICS '93 Proceedings of the 7th international conference on Supercomputing
Effects of memory latencies on non-blocking processor/cache architectures
ICS '93 Proceedings of the 7th international conference on Supercomputing
VLIW compilation techniques in a superscalar environment
PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Iterative modulo scheduling: an algorithm for software pipelining loops
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Minimum register requirements for a modulo schedule
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Minimizing register requirements under resource-constrained rate-optimal software pipelining
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Instruction scheduling in the TOBEY compiler
IBM Journal of Research and Development
Compiler transformations for high-performance computing
ACM Computing Surveys (CSUR)
GURRR: a global unified resource requirements representation
IR '95 Papers from the 1995 ACM SIGPLAN workshop on Intermediate representations
Scheduling and mapping: software pipelining in the presence of structural hazards
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
ACM Computing Surveys (CSUR)
Rephasing: a transformation technique for the manipulation of timing constraints
DAC '95 Proceedings of the 32nd annual ACM/IEEE Design Automation Conference
Optimum modulo schedules for minimum register requirements
ICS '95 Proceedings of the 9th international conference on Supercomputing
The meeting graph: a new model for loop cyclic register allocation
PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Automatic generation of loop scheduling for VLIW
PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Modulo scheduling with multiple initiation intervals
Proceedings of the 28th annual international symposium on Microarchitecture
Region-based compilation: an introduction and motivation
Proceedings of the 28th annual international symposium on Microarchitecture
Proceedings of the 28th annual international symposium on Microarchitecture
An effective programmable prefetch engine for on-chip caches
Proceedings of the 28th annual international symposium on Microarchitecture
Unrolling-based optimizations for modulo scheduling
Proceedings of the 28th annual international symposium on Microarchitecture
Stage scheduling: a technique to reduce the register requirements of a modulo schedule
Proceedings of the 28th annual international symposium on Microarchitecture
Hypernode reduction modulo scheduling
Proceedings of the 28th annual international symposium on Microarchitecture
IEEE Transactions on Parallel and Distributed Systems
Software pipelining showdown: optimal vs. heuristic methods in a production compiler
PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
A reduced multipipeline machine description that preserves scheduling constraints
PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Exploiting dual data-memory banks in digital signal processors
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Data prefetching and multilevel blocking for linear algebra operations
ICS '96 Proceedings of the 10th international conference on Supercomputing
Block algorithms for sparse matrix computations on high performance workstations
ICS '96 Proceedings of the 10th international conference on Supercomputing
POPL '96 Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Modulo scheduling of loops in control-intensive non-numeric programs
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Heuristics for register-constrained software pipelining
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Software pipelining loops with conditional branches
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Combining loop transformations considering caches and scheduling
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Instruction scheduling for the HP PA-8000
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Meld scheduling: relaxing scheduling constraints across region boundaries
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
A Framework for Resource-Constrained Rate-Optimal Software Pipelining
IEEE Transactions on Parallel and Distributed Systems
Achieving Full Parallelism Using Multidimensional Retiming
IEEE Transactions on Parallel and Distributed Systems
Towards efficient fine-grain software pipelining
ICS '90 Proceedings of the 4th international conference on Supercomputing
Efficient scheduling of fine grain parallelism in loops
MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Employing finite automata for resource scheduling
MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
A software pipelining based VLIW architecture and optimizing compiler
MICRO 23 Proceedings of the 23rd annual workshop and symposium on Microprogramming and microarchitecture
Software pipelining: a comparison and improvement
MICRO 23 Proceedings of the 23rd annual workshop and symposium on Microprogramming and microarchitecture
Using a lookahead window in a compaction-based parallelizing compiler
MICRO 23 Proceedings of the 23rd annual workshop and symposium on Microprogramming and microarchitecture
High-level microprogramming: an optimizing C compiler for a processing element of a CAD accelerator
MICRO 23 Proceedings of the 23rd annual workshop and symposium on Microprogramming and microarchitecture
The 16-fold way: a microparallel taxonomy
MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Techniques for extracting instruction level parallelism on MIMD architectures
MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
A VLIW architecture based on shifting register files
MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Determining the Order of Processor Transactions in StaticallyScheduled Multiprocessors
Journal of VLSI Signal Processing Systems
CP-PACS: a massively parallel processor for large scale scientific calculations
ICS '97 Proceedings of the 11th international conference on Supercomputing
Exploiting instruction level parallelism in processors by caching scheduled groups
Proceedings of the 24th annual international symposium on Computer architecture
Tuning compiler optimizations for simultaneous multithreading
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Can program profiling support value prediction?
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Cache sensitive modulo scheduling
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Parallelizing nonnumerical code with selective scheduling and software pipelining
ACM Transactions on Programming Languages and Systems (TOPLAS)
Circuit Retiming Applied to Decomposed Software Pipelining
IEEE Transactions on Parallel and Distributed Systems
Compiler blockability of dense matrix factorizations
ACM Transactions on Mathematical Software (TOMS)
A general algorithm for tiling the register level
ICS '98 Proceedings of the 12th international conference on Supercomputing
Resource widening versus replication: limits and performance-cost trade-off
ICS '98 Proceedings of the 12th international conference on Supercomputing
The effect of instruction fetch bandwidth on value prediction
Proceedings of the 25th annual international symposium on Computer architecture
RECOD: a retiming heuristic to optimize resource and memory utilization in HW/SW codesigns
Proceedings of the 6th international workshop on Hardware/software codesign
Experiences with Cooperating Register Allocation and Instruction Scheduling
International Journal of Parallel Programming
Optimal Modulo Scheduling Through Enumeration
International Journal of Parallel Programming
Modulo Scheduling with Reduced Register Pressure
IEEE Transactions on Computers
IMPACT: an architectural framework for multiple-instruction-issue processors
25 years of the international symposia on Computer architecture (selected papers)
Reducing Data Hazards on Multi-pipelined DSP Architecture with Loop Scheduling
Journal of VLSI Signal Processing Systems - Special issue on future directions in the design and implementations of DSP systems
Analyzing Asynchronous Pipeline Schedules
International Journal of Parallel Programming
Quantitative Evaluation of Register Pressure on Software Pipelined Loops
International Journal of Parallel Programming
Using value prediction to increase the power of speculative execution hardware
ACM Transactions on Computer Systems (TOCS)
Split-path enhanced pipeline scheduling for loops with control flows
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Effective cluster assignment for modulo scheduling
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Widening resources: a cost-effective technique for aggressive ILP architectures
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Space-time scheduling of instruction-level parallelism on a raw machine
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Dependence based prefetching for linked data structures
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Resource constrained dataflow retiming heuristics for VLIW ASIPs
CODES '99 Proceedings of the seventh international workshop on Hardware/software codesign
Modulo scheduling for the TMS320C6x VLIW DSP architecture
Proceedings of the ACM SIGPLAN 1999 workshop on Languages, compilers, and tools for embedded systems
Boosting beyond static scheduling in a superscalar processor
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Probabilistic Loop Scheduling for Applications with Uncertain Execution Time
IEEE Transactions on Computers
Unroll-based register coalescing
Proceedings of the 14th international conference on Supercomputing
Function unit specialization through code analysis
ICCAD '99 Proceedings of the 1999 IEEE/ACM international conference on Computer-aided design
Tuning Compiler Optimizations for Simultaneous Multithreading
International Journal of Parallel Programming - Special issue on the 30th annual ACM/IEEE international symposium on microarchitecture, part II
Co-Synthesis to a Hybrid RISC/FPGA Architecture
Journal of VLSI Signal Processing Systems - Special issue on VLSI on custom computing technology
Supporting Timing Analysis by Automatic Bounding of LoopIterations
Real-Time Systems - Special issue on worst-case execution-time analysis
Matrix multiplication: a case study of enhanced data cache utilization
Journal of Experimental Algorithmics (JEA)
Properties and Algorithms for Unfolding of Probabilistic Data-Flow Graphs
Journal of VLSI Signal Processing Systems
Loop Shifting for Loop Compaction
International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, part 2
IEEE Transactions on Computers
ACM SIGPLAN Notices
Modulo scheduling for a fully-distributed clustered VLIW architecture
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Two-level hierarchical register file organization for VLIW processors
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Constraint analysis for code generation: basic techniques and applications in FACTS
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Hardware/software partitioning with integrated hardware design space exploration
Proceedings of the conference on Design, automation and test in Europe
Lifetime-Sensitive Modulo Scheduling in a Production Environment
IEEE Transactions on Computers
Proceedings of the 2001 Asia and South Pacific Design Automation Conference
Compiler-based I/O prefetching for out-of-core applications
ACM Transactions on Computer Systems (TOCS)
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Proceedings of the 38th annual Design Automation Conference
Power-aware modulo scheduling for high-performance VLIW processors
ISLPED '01 Proceedings of the 2001 international symposium on Low power electronics and design
Scheduling time-constrained instructions on pipelined processors
ACM Transactions on Programming Languages and Systems (TOPLAS)
Loop Transformations for Architectures with Partitioned Register Banks
OM '01 Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systems
Software Pipelining Irregular Loops On the TMS320C6000 VLIW DSP Architecture
OM '01 Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systems
FDRA: a software-pipelining algorithm for embedded VLIW processors
ISSS '00 Proceedings of the 13th international symposium on System synthesis
Instruction scheduling for clustered VLIW architectures
ISSS '00 Proceedings of the 13th international symposium on System synthesis
Code generation for embedded processors
ISSS '00 Proceedings of the 13th international symposium on System synthesis
ShiftQ: a bufferred interconnect for custom loop accelerators
CASES '01 Proceedings of the 2001 international conference on Compilers, architecture, and synthesis for embedded systems
Compiler-Assisted Multiple Instruction Word Retry for VLIW Architectures
IEEE Transactions on Parallel and Distributed Systems
Evaluating the Use of Register Queues in Software Pipelined Loops
IEEE Transactions on Computers - Special issue on the parallel architecture and compilation techniques conference
Automatic formal verification for scheduled VLIW code
Proceedings of the joint conference on Languages, compilers and tools for embedded systems: software and compilers for embedded systems
Loop fusion for clustered VLIW architectures
Proceedings of the joint conference on Languages, compilers and tools for embedded systems: software and compilers for embedded systems
Affinity-based cluster assignment for unrolled loops
ICS '02 Proceedings of the 16th international conference on Supercomputing
Optimal software pipelining of loops with control flows
ICS '02 Proceedings of the 16th international conference on Supercomputing
An interleaved cache clustered VLIW processor
ICS '02 Proceedings of the 16th international conference on Supercomputing
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Graph-partitioning based instruction scheduling for clustered processors
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Enhancing loop buffering of media and telecommunications applications using low-overhead predication
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Embedded software in real-time signal processing systems: design technologies
Readings in hardware/software co-design
Constraint analysis for DSP code generation
Readings in hardware/software co-design
Register tiling in nonrectangular iteration spaces
ACM Transactions on Programming Languages and Systems (TOPLAS)
Optimal code size reduction for software-pipelined and unfolded loops
Proceedings of the 15th international symposium on System Synthesis
PACT HDL: a C compiler targeting ASICs and FPGAs with power and performance optimizations
CASES '02 Proceedings of the 2002 international conference on Compilers, architecture, and synthesis for embedded systems
On achieving balanced power consumption in software pipelined loops
CASES '02 Proceedings of the 2002 international conference on Compilers, architecture, and synthesis for embedded systems
Minimizing Buffer Requirements under Rate-Optimal Schedule in Regular Dataflow Networks
Journal of VLSI Signal Processing Systems
Constraint satisfaction for relative location assignment and scheduling
Proceedings of the 2001 IEEE/ACM international conference on Computer-aided design
Hardware-Software partitioning and pipelined scheduling of transformative applications
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A Simulation Study of Decoupled Vector Architectures
The Journal of Supercomputing
Enhanced Co-Scheduling: A Software Pipelining Method Using Modulo-Scheduled Pipeline Theory
International Journal of Parallel Programming
Handling Global Constraints in Compiler Strategy
International Journal of Parallel Programming
A Vectorizing Compiler for Multimedia Extensions
International Journal of Parallel Programming
Meld Scheduling: A Technique for Relaxing Scheduling Constraints
International Journal of Parallel Programming
Combining Loop Transformations Considering Caches and Scheduling
International Journal of Parallel Programming
The Intel IA-64 Compiler Code Generator
IEEE Micro
Instruction Window Size Trade-Offs and Characterization of Program Parallelism
IEEE Transactions on Computers
Three Architectural Models for Compiler-Controlled Speculative Execution
IEEE Transactions on Computers
A Performance and Cost Analysis of Applying Superscalar Method to Mainframe Computers
IEEE Transactions on Computers
Unroll-Based Copy Elimination for Enhanced Pipeline Scheduling
IEEE Transactions on Computers
A Loop Transformation Theory and an Algorithm to Maximize Parallelism
IEEE Transactions on Parallel and Distributed Systems
Making Compaction-Based Parallelization Affordable
IEEE Transactions on Parallel and Distributed Systems
Generalized Multiway Branch Unit for VLIW Microprocessors
IEEE Transactions on Parallel and Distributed Systems
Heuristic Algorithms for Scheduling Iterative Task Computations on Distributed Memory Machines
IEEE Transactions on Parallel and Distributed Systems
Hypercube Algorithms on Mesh Connected Multicomputers
IEEE Transactions on Parallel and Distributed Systems
A finite state machine based format model of software pipelined loops with conditions
Progress in computer research
Probabilistic Rotation: Scheduling Graphs with Uncertain Execution Time
ICPP '97 Proceedings of the international Conference on Parallel Processing
Run-Time Support to Register Allocation for Loop Parallelization of Image Processing Programs
HPCN Europe 2000 Proceedings of the 8th International Conference on High-Performance Computing and Networking
Improving Code Efficiency for Reconfigurable VLIW Processors
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Efficient Pipelining of Nested Loops: Unroll-and-Squash
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
CPR: Mixed Task and Data Parallel Scheduling for Distributed Systems
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Unroll-Based Copy Elimination for Enhanced Pipeline Scheduling
LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
Loop Shifting for Loop Compaction
LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
Optimizing Loop Performance for Clustered VLIW Architectures
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Exploiting Pseudo-Schedules to Guide Data Dependence Graph Partitioning
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Influence of Variable Time Operations in Static Instruction Scheduling
Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
Software pipelining: A Genetic Algorithm Approach
PACT '94 Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques
PACT '93 Proceedings of the IFIP WG10.3. Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism
Software Pipelining: Petri Net Pacemaker
PACT '93 Proceedings of the IFIP WG10.3. Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism
Balancing Fine- and Medium-Grained Parallelism in Scheduling Loops for the XIMD Architecture
PACT '93 Proceedings of the IFIP WG10.3. Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism
Modeling Instruction-Level Parallelism for Software Pipelining
PACT '93 Proceedings of the IFIP WG10.3. Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism
Global Software Pipelining with Iteration Preselection
CC '00 Proceedings of the 9th International Conference on Compiler Construction
Software Pipelining of Nested Loops
CC '01 Proceedings of the 10th International Conference on Compiler Construction
A First Step Towards Time Optimal Software Pipelining of Loops with Control Flows
CC '01 Proceedings of the 10th International Conference on Compiler Construction
Reduced code size modulo scheduling in the absence of hardware support
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Proceedings of the 40th annual Design Automation Conference
Predicate-aware scheduling: a technique for reducing resource constraints
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Split-Path Enhanced Pipeline Scheduling
IEEE Transactions on Parallel and Distributed Systems
A compiler approach for reducing data cache energy
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Hades-towards the design of an asynchronous superscalar processor
ASYNC '95 Proceedings of the 2nd Working Conference on Asynchronous Design Methodologies
Architecture Design of Reconfigurable Pipelined Datapaths
ARVLSI '99 Proceedings of the 20th Anniversary Conference on Advanced Research in VLSI
Non-Consistent Dual Register Files to Reduce Register Pressure
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Decoupled vector architectures
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Co-Scheduling Hardware and Software Pipelines
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
The Architecture of Massively Parallel Processor CP-PACS
PAS '97 Proceedings of the 2nd AIZU International Symposium on Parallel Algorithms / Architecture Synthesis
The Effectiveness of Loop Unrolling for Modulo Scheduling in Clustered VLIW Architectures
ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Efficient Scheduling of DSP Code on Processors with Distributed Register Files
Proceedings of the 12th international symposium on System synthesis
Jacobi Orderings for Multi-Port Hypercubes
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Register-Sensitive Software Pipelining
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
An Enhanced Co-Scheduling Method using Reduced MS-State Diagrams
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Mapping of generalized template matching onto reconfigurable computers
IEEE Transactions on Very Large Scale Integration (VLSI) Systems - Special section on the 2001 international conference on computer design (ICCD)
CODES '94 Proceedings of the 3rd international workshop on Hardware/software co-design
Code size reduction technique and implementation for software-pipelined DSP applications
ACM Transactions on Embedded Computing Systems (TECS)
Automatic generation of application specific processors
Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
A timed Petri-net model for fine-grain loop scheduling
CASCON '91 Proceedings of the 1991 conference of the Centre for Advanced Studies on Collaborative research
Register allocation for optimal loop scheduling
CASCON '93 Proceedings of the 1993 conference of the Centre for Advanced Studies on Collaborative research: distributed computing - Volume 2
Loop Shifting and Compaction for the High-Level Synthesis of Designs with Complex Control Flow
Proceedings of the conference on Design, automation and test in Europe - Volume 1
Analysis and Modeling of Energy Reducing Source Code Transformations
Proceedings of the conference on Design, automation and test in Europe - Volume 3
Instruction Scheduling for Low Power
Journal of VLSI Signal Processing Systems
An experimental evaluation of scalar replacement on scientific benchmarks
Software—Practice & Experience
Application-domain-driven system design for pervasive video processing
Ambient intelligence
Register Constrained Modulo Scheduling
IEEE Transactions on Parallel and Distributed Systems
Code Generation for Single-Dimension Software Pipelining of Multi-Dimensional Loops
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Single-Dimension Software Pipelining for Multi-Dimensional Loops
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Probabilistic Predicate-Aware Modulo Scheduling
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Real-Time Imaging - Special issue on software engineering
The design of dynamically reconfigurable datapath coprocessors
ACM Transactions on Embedded Computing Systems (TECS)
Field-testing IMPACT EPIC research results in Itanium 2
Proceedings of the 31st annual international symposium on Computer architecture
Time optimal software pipelining of loops with control flows
International Journal of Parallel Programming
Optimistic register coalescing
ACM Transactions on Programming Languages and Systems (TOPLAS)
DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Fast and Accurate Multiprocessor Architecture Exploration with Symbolic Programs
DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Multithreaded Synchronous Data Flow Simulation
DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Static Placement, Dynamic Issue (SPDI) Scheduling for EDGE Architectures
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Instruction level parallelism of non-uniform acyclic loops
Journal of Computing Sciences in Colleges
Combining Extended Retiming and Unfolding for Rate-Optimal Graph Transformation
Journal of VLSI Signal Processing Systems
Register allocation for software pipelined multi-dimensional loops
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Automatically partitioning packet processing applications for pipelined architectures
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Complementing software pipelining with software thread integration
LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Automatic multithreading and multiprocessing of C programs for IXP
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
A reprogrammable customization framework for efficient branch resolution in embedded processors
ACM Transactions on Embedded Computing Systems (TECS)
Future wireless convergence platforms
CODES+ISSS '05 Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Cutpoints for formal equivalence verification of embedded software
Proceedings of the 5th ACM international conference on Embedded software
Reducing data cache leakage energy using a compiler-based approach
ACM Transactions on Embedded Computing Systems (TECS)
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Using a lookahead window in a compaction-based parallelizing compiler
ACM SIGMICRO Newsletter
Automatic Thread Extraction with Decoupled Software Pipelining
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Exploiting Vector Parallelism in Software Pipelined Loops
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Compiler-directed high-level energy estimation and optimization
ACM Transactions on Embedded Computing Systems (TECS)
Software and hardware techniques to optimize register file utilization in VLIW architectures
International Journal of Parallel Programming
Combining extended retiming and unfolding for rate-optimal graph transformation
Journal of VLSI Signal Processing Systems
A new register file access architecture for software pipelining in VLIW processors
Proceedings of the 2005 Asia and South Pacific Design Automation Conference
Compiler transformations for effectively exploiting a zero overhead loop buffer
Software—Practice & Experience
Automatic instruction scheduler retargeting by reverse-engineering
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Generic software pipelining at the assembly level
SCOPES '05 Proceedings of the 2005 workshop on Software and compilers for embedded systems
Compiling for stream processing
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Reaching fast code faster: using modeling for efficient software thread integration on a VLIW DSP
CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Embedded software verification using symbolic execution and uninterpreted functions
International Journal of Parallel Programming
Merging Head and Tail Duplication for Convergent Hyperblock Formation
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Single-dimension software pipelining for multidimensional loops
ACM Transactions on Architecture and Code Optimization (TACO)
Journal of VLSI Signal Processing Systems
FEADS: a framework for exploring the application design space on network processors
International Journal of Parallel Programming
An Analytical Approach to Scheduling Code for Superscalar and VLIW Architectures
ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 01
Register pointer architecture for efficient embedded processors
Proceedings of the conference on Design, automation and test in Europe
Executing irregular scientific applications on stream architectures
Proceedings of the 21st annual international conference on Supercomputing
MPSoC memory optimization using program transformation
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Software optimization of video codecs on pentium processor with MMX technology
EURASIP Journal on Applied Signal Processing
Pfelib: a performance primitives library for embedded vision
EURASIP Journal on Embedded Systems
Facilitating compiler optimizations through the dynamic mapping of alternate register structures
CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems
Latency-tolerant software pipelining in a production compiler
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Algorithms and analysis of scheduling for loops with minimum switching
International Journal of Computational Science and Engineering
A new strategy for multiprocessor scheduling of cyclic task graphs
International Journal of High Performance Computing and Networking
Proceedings of the 2008 ACM symposium on Applied computing
Optimized mapping for enchancing the operation parallelism in coarse-grained reconfigurable arrays
SMO'06 Proceedings of the 6th WSEAS International Conference on Simulation, Modelling and Optimization
Rotating register allocation with multiple rotating branches
Proceedings of the 22nd annual international conference on Supercomputing
Post-pass periodic register allocation to minimise loop unrolling degree
Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on Languages, compilers, and tools for embedded systems
Placement-and-routing-based register allocation for coarse-grained reconfigurable arrays
Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on Languages, compilers, and tools for embedded systems
ACM Transactions on Embedded Computing Systems (TECS)
Register allocation for software pipelined multidimensional loops
ACM Transactions on Programming Languages and Systems (TOPLAS)
VEAL: Virtualized Execution Accelerator for Loops
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Automatic architecture refinement techniques for customizing processing elements
Proceedings of the 45th annual Design Automation Conference
Automated dynamic throughput-constrained structural-level pipelining in streaming applications
Proceedings of the conference on Design, automation and test in Europe
Validating High-Level Synthesis
CAV '08 Proceedings of the 20th international conference on Computer Aided Verification
Stream Scheduling: A Framework to Manage Bulk Operations in Memory Hierarchies
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Timing optimization via nest-loop pipelining considering code size
Microprocessors & Microsystems
Integrated Modulo Scheduling for Clustered VLIW Architectures
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Software Pipelining in Nested Loops with Prolog-Epilog Merging
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Synthesis of reconfigurable high-performance multicore systems
Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Resource aware mapping on coarse grained reconfigurable arrays
Microprocessors & Microsystems
Design and implementation of a queue compiler
Microprocessors & Microsystems
Periodic register saturation in innermost loops
Parallel Computing
Proceedings of the 6th ACM conference on Computing frontiers
Compiler assisted architectural exploration framework for coarse grained reconfigurable arrays
The Journal of Supercomputing
Modulo scheduling without overlapped lifetimes
Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Mapping of nomadic multimedia applications on the ADRES reconfigurable array processor
Microprocessors & Microsystems
Design and Tool Flow of Multimedia MPSoC Platforms
Journal of Signal Processing Systems
Energy-Aware Loop Scheduling and Assignment for Multi-Core, Multi-Functional-Unit Architecture
Journal of Signal Processing Systems
A simple, verified validator for software pipelining
Proceedings of the 37th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Preprocessing strategy for effective modulo scheduling on multi-issue digital signal processors
CC'07 Proceedings of the 16th international conference on Compiler construction
CC'07 Proceedings of the 16th international conference on Compiler construction
Integrating high-level optimizations in a production compiler: design and implementation experience
CC'03 Proceedings of the 12th international conference on Compiler construction
MIRS: modulo scheduling with integrated register spilling
LCPC'01 Proceedings of the 14th international conference on Languages and compilers for parallel computing
Minimizing communication in rate-optimal software pipelining for stream programs
Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Towards a source level compiler: source level modulo scheduling
Program analysis and compilation, theory and practice
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Translation validation of high-level synthesis
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Reducing Memory Constraints in Modulo Scheduling Synthesis for FPGAs
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
An Efficient Memory Organization for High-ILP Inner Modem Baseband SDR Processors
Journal of Signal Processing Systems
Fine-grain dynamic instruction placement for L0 scratch-pad memory
CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Automatic memory partitioning: increasing memory parallelism via data structure partitioning
CODES/ISSS '10 Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
SAS'10 Proceedings of the 17th international conference on Static analysis
Hierarchical multithreading: programming model and system software
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
How many threads to spawn during program multithreading?
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Using a "codelet" program execution model for exascale machines: position paper
Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Precedence constraint posting for cyclic scheduling problems
CPAIOR'11 Proceedings of the 8th international conference on Integration of AI and OR techniques in constraint programming for combinatorial optimization problems
The Journal of Supercomputing
Improving performance through deep value profiling and specialization with code transformation
Computer Languages, Systems and Structures
ACM Transactions on Embedded Computing Systems (TECS)
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Combined ILP and register tiling: analytical model and optimization framework
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Exploring the limits of GPGPU scheduling in control flow bound applications
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Instruction re-selection for iterative modulo scheduling on high performance multi-issue DSPs
EUC'06 Proceedings of the 2006 international conference on Emerging Directions in Embedded and Ubiquitous Computing
SCAN: a heuristic for near-optimal software pipelining
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Multi-dimensional kernel generation for loop nest software pipelining
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Task partitioning for multi-core network processors
CC'05 Proceedings of the 14th international conference on Compiler Construction
Trimaran: an infrastructure for research in instruction-level parallelism
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Using the meeting graph framework to minimise kernel loop unrolling for scheduled loops
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Single thread program parallelism with dataflow abstracting thread
ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
Integrated Code Generation for Loops
ACM Transactions on Embedded Computing Systems (TECS)
Scheduling expression DAGs for minimal register need
Computer Languages
Mathematical and Computer Modelling: An International Journal
Automatic generation of software pipelines for heterogeneous parallel systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Cache-sensitive MapReduce DGEMM algorithms for shared memory architectures
Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference
Optimal and heuristic global code motion for minimal spilling
CC'13 Proceedings of the 22nd international conference on Compiler Construction
Near-Optimal Microprocessor and Accelerators Codesign with Latency and Throughput Constraints
ACM Transactions on Architecture and Code Optimization (TACO)
On-the-fly pipeline parallelism
Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Software thread integration for instruction-level parallelism
ACM Transactions on Embedded Computing Systems (TECS)
A catalog of stream processing optimizations
ACM Computing Surveys (CSUR)
The benefits of using variable-length pipelined operations in high-level synthesis
ACM Transactions on Embedded Computing Systems (TECS)
Allocating rotating registers by scheduling
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Just-In-Time Software Pipelining
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
SDC-based modulo scheduling for pipeline synthesis
Proceedings of the International Conference on Computer-Aided Design
Predicate-aware, makespan-preserving software pipelining of scheduling tables
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.02 |
This paper shows that software pipelining is an effective and viable scheduling technique for VLIW processors. In software pipelining, iterations of a loop in the source program are continuously initiated at constant intervals, before the preceding iterations complete. The advantage of software pipelining is that optimal performance can be achieved with compact object code.This paper extends previous results of software pipelining in two ways: First, this paper shows that by using an improved algorithm, near-optimal performance can be obtained without specialized hardware. Second, we propose a hierarchical reduction scheme whereby entire control constructs are reduced to an object similar to an operation in a basic block. With this scheme, all innermost loops, including those containing conditional statements, can be software pipelined. It also diminishes the start-up cost of loops with small number of iterations. Hierarchical reduction complements the software pipelining technique, permitting a consistent performance improvement be obtained.The techniques proposed have been validated by an implementation of a compiler for Warp, a systolic array consisting of 10 VLIW processors. This compiler has been used for developing a large number of applications in the areas of image, signal and scientific processing.