Properties of data flow frameworks: a unified model
Acta Informatica
P-Complete Approximation Problems
Journal of the ACM (JACM)
Structured Programming with go to Statements
ACM Computing Surveys (CSUR)
Communications of the ACM
Efficient conditional operations for data-parallel architectures
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Böhm and Jacopini's reduction of flow charts
Communications of the ACM
Ray tracing on programmable graphics hardware
Proceedings of the 29th annual conference on Computer graphics and interactive techniques
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Efficient partitioning of fragment shaders for multipass rendering on programmable graphics hardware
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Computers and Intractability: A Guide to the Theory of NP-Completeness
Computers and Intractability: A Guide to the Theory of NP-Completeness
Mio: fast multipass partitioning via priority-based instruction scheduling
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Efficient partitioning of fragment shaders for multiple-output hardware
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Optimal automatic multi-pass shader partitioning by dynamic programming
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Compiling for stream processing
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Hi-index | 0.01 |
Heterogeneous multi-core and streaming architectures such as the GPU, Cell, ClearSpeed, and Imagine processors have better power/ performance ratios and memory bandwidth than traditional architectures. These types of processors are increasingly being used to accelerate compute-intensive applications. Their performance advantage is achieved by using multiple SIMD processor cores but limiting the complexity of each core, and by combining this with a simplified memory system. In particular, these processors generally avoid the use of cache coherency protocols and may even omit general-purpose caches, opting for restricted caches or explictly managed local memory. We show how control flow can be emulated on such tiled SIMD architectures and how memory access can be organized to avoid the need for a general-purpose cache and to tolerate long memory latencies. Our technique uses streaming execution and multipass partitioning. Our prototype targets GPUs. On GPUs the memory system is deeply pipelined and caches for read and write are not coherent, so reads and writes may not use the same memory locations simultaneously. This requires the use of double-buffered streaming. We emulate general control flow in a way that is transparent to the programmer and include specific optimizations in our approach that can deal with double-buffering.