Spill code placement for SIMD machines

Authors:
Diogo Nunes Sampaio;Elie Gedeon;Fernando Magno Quintão Pereira;Sylvain Collange
Affiliations:
Departamento de Ciência da Computação, UFMG, Brazil;Departamento de Ciência da Computação, UFMG, Brazil;Departamento de Ciência da Computação, UFMG, Brazil;Departamento de Ciência da Computação, UFMG, Brazil
Venue:
SBLP'12 Proceedings of the 16th Brazilian conference on Programming Languages
Year:
2012

Citing 24
Cited 1

Constant propagation with conditional branches

ACM Transactions on Programming Languages and Systems (TOPLAS)
Efficiently computing static single assignment form and the control dependence graph

ACM Transactions on Programming Languages and Systems (TOPLAS)
Rematerialization

PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
A linear time algorithm for placing &phgr;-nodes

POPL '95 Proceedings of the 22nd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Linear scan register allocation

ACM Transactions on Programming Languages and Systems (TOPLAS)
On local register allocation

Journal of Algorithms
Complete register allocation problems

STOC '73 Proceedings of the fifth annual ACM symposium on Theory of computing
The history of FORTRAN I, II, and III

ACM SIGPLAN Notices - Special issue: History of programming languages conference
Compilers: Principles, Techniques, and Tools (2nd Edition)

Compilers: Principles, Techniques, and Tools (2nd Edition)
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Parallel Computing Experiences with CUDA

IEEE Micro
A control-structure splitting optimization for GPGPU

Proceedings of the 6th ACM conference on Computing frontiers
A study of replacement algorithms for a virtual-storage computer

IBM Systems Journal
Rodinia: A benchmark suite for heterogeneous computing

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
An adaptive performance modeling tool for GPU architectures

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
The GPU Computing Era

IEEE Micro
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Proceedings of the 37th annual international symposium on Computer architecture
Understanding throughput-oriented architectures

Communications of the ACM
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
On-the-fly elimination of dynamic irregularities for GPU computing

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Reducing branch divergence in GPU programs

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Register allocation after classical SSA elimination is NP-Complete

FOSSACS'06 Proceedings of the 9th European joint conference on Foundations of Software Science and Computation Structures
Divergence Analysis and Optimizations

PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Register allocation via coloring

Computer Languages

Divergence analysis

ACM Transactions on Programming Languages and Systems (TOPLAS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Single Instruction, Multiple Data (SIMD) execution model has been receiving renewed attention recently. This awareness stems from the rise of graphics processing units (GPUs) as a powerful alternative for parallel computing. Many compiler optimizations have been recently proposed for this hardware, but register allocation is a field yet to be explored. In this context, this paper describes a register spiller for SIMD machines that capitalizes on the opportunity to share identical data between threads. It provides two different benefits: first, it uses less memory, as more spilled values are shared among threads. Second, it improves the access times to spilled values. We have implemented our proposed allocator in the Ocelot open source compiler, and have been able to speedup the code produced by this framework by 21%. Although we have designed our algorithm on top of a linear scan register allocator, we claim that our ideas can be easily adapted to fit the necessities of other register allocators.