Optimizing delayed branches

Authors:
Thomas R. Gross;John L. Hennessy
Affiliations:
Departments of Electrical Engineering and Computer Science, Stanford University;Departments of Electrical Engineering and Computer Science, Stanford University
Venue:
MICRO 15 Proceedings of the 15th annual workshop on Microprogramming
Year:
1982

Citing 6
Cited 33

Code generation and reorganization in the presence of pipeline constraints

POPL '82 Proceedings of the 9th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
MIPS: A microprocessor architecture

MICRO 15 Proceedings of the 15th annual workshop on Microprogramming
The 801 minicomputer

ASPLOS I Proceedings of the first international symposium on Architectural support for programming languages and operating systems
RISC I: A Reduced Instruction Set VLSI Computer

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
2n-way jump microinstruction hardware and an effective instruction binding method

MICRO 13 Proceedings of the 13th annual workshop on Microprogramming
Analysis and performance of computer instruction sets.

Analysis and performance of computer instruction sets.

Highly concurrent scalar processing

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
An evaluation of branch architectures

ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Measurement and evaluation of the MIPS architecture and processor

ACM Transactions on Computer Systems (TOCS)
Reducing the Branch Penalty in Pipelined Processors

Computer
RISCY patents

ACM SIGARCH Computer Architecture News - Special Issue: Architectural Support for Operating Systems
Microcode compaction with timing constraints

ACM SIGMICRO Newsletter
Forward semantic: a compiler-assisted instruction fetch method for heavily pipelined processors

MICRO 22 Proceedings of the 22nd annual workshop on Microprogramming and microarchitecture
Reducing the branch penalty by rearranging instructions in a double-width memory

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Integrating register allocation and instruction scheduling for RISCs

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The Marion system for retargetable instruction scheduling

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
The effect on RISC performance of register set size and structure versus code generation strategy

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Two-level adaptive training branch prediction

MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
Branch Strategies: Modeling and Optimization (Pipeline Processing)

IEEE Transactions on Computers
Alternative implementations of two-level adaptive branch prediction

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
A comprehensive instruction fetch mechanism for a processor supporting speculative execution

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Evaluating Performance Tradeoffs Between Fine-Grained and Coarse-Grained Alternatives

IEEE Transactions on Parallel and Distributed Systems
An instruction reoderer for pipelined computers

MICRO 23 Proceedings of the 23rd annual workshop and symposium on Microprogramming and microarchitecture
MIDEE: smoothing branch and instruction cache miss penalties on deep pipelines

MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Microcode compaction with timing constraints

MICRO 20 Proceedings of the 20th annual workshop on Microprogramming
Very long instruction work architectures and the ELI-512

25 years of the international symposia on Computer architecture (selected papers)
Alternative implementations of two-level adaptive branch prediction

25 years of the international symposia on Computer architecture (selected papers)
Instruction fetch unit for parallel execution of branch instructions

ICS '89 Proceedings of the 3rd international conference on Supercomputing
PIPE: a VLSI decoupled architecture

ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
Efficient Instruction Sequencing with Inline Target Insertion

IEEE Transactions on Computers
Reducing Branch Delay to Zero in Pipelined Processors

IEEE Transactions on Computers
Branch Target Buffer Design and Optimization

IEEE Transactions on Computers
MIPS: A microprocessor architecture

MICRO 15 Proceedings of the 15th annual workshop on Microprogramming
Very Long Instruction Word architectures and the ELI-512

ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Applications of pipelining to firmware

MICRO 17 Proceedings of the 17th annual workshop on Microprogramming
An improvement of trace scheduling for global microcode compaction

MICRO 17 Proceedings of the 17th annual workshop on Microprogramming
Global methods in the flow graph approach to retargetable microcode generation

MICRO 17 Proceedings of the 17th annual workshop on Microprogramming
SRDAG compaction: a generalization of trace scheduling to increase the use of global context information

ACM SIGMICRO Newsletter
A backtracking instruction scheduler using predicate-based code hoisting to fill delay slots

CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

Delayed branches are commonly found in micro-architectures. A compiler or assembler can exploit delayed branches. This is achieved by moving code from one of several points to the positions following the branch instruction. We present several strategies for moving code to utilize the branch delay, and discuss the requirements and benefits of these strategies. An algorithm for processing branch delays has been implemented and we give empirical results. The performance data show that a reasonable percentage of these delays can be avoided.