Preallocation instruction scheduling with register pressure minimization using a combinatorial optimization approach

Authors:
Ghassan Shobaki;Maxim Shawabkeh;Najm Eldeen Abu Rmaileh
Affiliations:
Princess Sumaya University for Technology, Jordan;Google;Princess Sumaya University for Technology, Jordan
Venue:
ACM Transactions on Architecture and Code Optimization (TACO)
Year:
2008

Citing 16
Cited 0

Code scheduling and register allocation in large basic blocks

ICS '88 Proceedings of the 2nd international conference on Supercomputing
A recursive technique for computing lower-bound performance of schedules

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Minimum Register Instruction Sequencing to Reduce Register Spills in Out-of-Order Issue Superscalar Architectures

IEEE Transactions on Computers
URSA: A Unified ReSource Allocator for Registers and Functional Units in VLIW Architectures

PACT '93 Proceedings of the IFIP WG10.3. Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism
Treegion Scheduling for Wide Issue Processors

HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
Optimal Superblock Scheduling Using Enumeration

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Register saturation in instruction level parallelism

International Journal of Parallel Programming
Optimal global instruction scheduling using enumeration

Optimal global instruction scheduling using enumeration
Subroutine profiling results for the CPU2006 benchmarks

ACM SIGARCH Computer Architecture News
Optimal versus Heuristic Global Code Scheduling

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Optimal trace scheduling using enumeration

ACM Transactions on Architecture and Code Optimization (TACO)
Engineering A Compiler

Engineering A Compiler
Constraint programming techniques for optimal instruction scheduling

Constraint programming techniques for optimal instruction scheduling
Scheduling expression DAGs for minimal register need

Computer Languages
Lower-bound performance estimation for the high-level synthesis scheduling problem

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Optimal and heuristic global code motion for minimal spilling

CC'13 Proceedings of the 22nd international conference on Compiler Construction

Quantified Score

Hi-index	0.00

Visualization

Abstract

Balancing Instruction-Level Parallelism (ILP) and register pressure during preallocation instruction scheduling is a fundamentally important problem in code generation and optimization. The problem is known to be NP-complete. Many heuristic techniques have been proposed to solve this problem. However, due to the inherently conflicting requirements of maximizing ILP and minimizing register pressure, heuristic techniques may produce poor schedules in many cases. If such cases occur in hot code, significant performance degradation may result. A few combinatorial optimization approaches have also been proposed, but none of them has been shown to solve large real-world instances within reasonable time. This article presents the first combinatorial algorithm that is efficient enough to optimally solve large instances of this problem (basic blocks with hundreds of instructions) within a few seconds per instance. The proposed algorithm uses branch-and-bound enumeration with a number of powerful pruning techniques to efficiently search the solution space. The search is based on a cost function that incorporates schedule length and register pressure. An implementation of the proposed scheduling algorithm has been integrated into the LLVM Compiler and evaluated using SPEC CPU 2006. On x86-64, with a time limit of 10ms per instruction, it optimally schedules 79% of the hot basic blocks in FP2006. Another 19% of the blocks are not optimally scheduled but are improved in cost relative to LLVM's heuristic. This improves the execution time of some benchmarks by up to 21%, with a geometric-mean improvement of 2.4% across the entire benchmark suite. With the use of precise latency information, the geometric-mean improvement is increased to 2.8%.