Control flow emulation on tiled SIMD architectures

Authors:
Ghulam Lashari;Ondřej Lhoták;Michael McCool
Affiliations:
D. R. Cheriton School of Computer Science, University of Waterloo;D. R. Cheriton School of Computer Science, University of Waterloo;D. R. Cheriton School of Computer Science, University of Waterloo
Venue:
CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction
Year:
2008

Citing 14
Cited 0

Properties of data flow frameworks: a unified model

Acta Informatica
P-Complete Approximation Problems

Journal of the ACM (JACM)
Structured Programming with go to Statements

ACM Computing Surveys (CSUR)
On folk theorems

Communications of the ACM
Efficient conditional operations for data-parallel architectures

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Böhm and Jacopini's reduction of flow charts

Communications of the ACM
Ray tracing on programmable graphics hardware

Proceedings of the 29th annual conference on Computer graphics and interactive techniques
Shader metaprogramming

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Efficient partitioning of fragment shaders for multipass rendering on programmable graphics hardware

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
Mio: fast multipass partitioning via priority-based instruction scheduling

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Efficient partitioning of fragment shaders for multiple-output hardware

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Optimal automatic multi-pass shader partitioning by dynamic programming

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Compiling for stream processing

Proceedings of the 15th international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.01

Visualization

Abstract

Heterogeneous multi-core and streaming architectures such as the GPU, Cell, ClearSpeed, and Imagine processors have better power/ performance ratios and memory bandwidth than traditional architectures. These types of processors are increasingly being used to accelerate compute-intensive applications. Their performance advantage is achieved by using multiple SIMD processor cores but limiting the complexity of each core, and by combining this with a simplified memory system. In particular, these processors generally avoid the use of cache coherency protocols and may even omit general-purpose caches, opting for restricted caches or explictly managed local memory. We show how control flow can be emulated on such tiled SIMD architectures and how memory access can be organized to avoid the need for a general-purpose cache and to tolerate long memory latencies. Our technique uses streaming execution and multipass partitioning. Our prototype targets GPUs. On GPUs the memory system is deeply pipelined and caches for read and write are not coherent, so reads and writes may not use the same memory locations simultaneously. This requires the use of double-buffered streaming. We emulate general control flow in a way that is transparent to the programmer and include specific optimizations in our approach that can deal with double-buffering.