The program dependence graph and its use in optimization
ACM Transactions on Programming Languages and Systems (TOPLAS)
PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Automatic construction of sparse data flow evaluation graphs
POPL '91 Proceedings of the 18th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Efficiently computing static single assignment form and the control dependence graph
ACM Transactions on Programming Languages and Systems (TOPLAS)
Control structures for data-parallel SIMD languages: semantics and implementation
Future Generation Computer Systems - Special issue: PARLE 91
PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Efficient building and placing of gating functions
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Formal specification of parallel SIMD execution
Theoretical Computer Science - Special issue on theoretical computer science in Australia and New Zealand
POPL '98 Proceedings of the 25th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
ACM SIGPLAN Notices
Linear scan register allocation
ACM Transactions on Programming Languages and Systems (TOPLAS)
A Language for Array and Vector Processors
ACM Transactions on Programming Languages and Systems (TOPLAS)
Glypnir—a programming language for Illiac IV
Communications of the ACM
Fast copy coalescing and live-range identification
PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Automatic discovery of linear restraints among variables of a program
POPL '78 Proceedings of the 5th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
ICSE '81 Proceedings of the 5th international conference on Software engineering
Efficient Oblivious Parallel Sorting on the MasPar MP-1
HICSS '97 Proceedings of the 30th Hawaii International Conference on System Sciences: Software Technology and Architecture - Volume 1
Higher-Order and Symbolic Computation
Compilers: Principles, Techniques, and Tools (2nd Edition)
Compilers: Principles, Techniques, and Tools (2nd Edition)
Wavefront Array Processor: Language, Architecture, and Applications
IEEE Transactions on Computers
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
TRANQUIL: a language for an array processing computer
AFIPS '69 (Spring) Proceedings of the May 14-16, 1969, spring joint computer conference
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
A control-structure splitting optimization for GPGPU
Proceedings of the 6th ACM conference on Computing frontiers
Programming model for a heterogeneous x86 platform
Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
GPU-Quicksort: A practical Quicksort algorithm for graphics processors
Journal of Experimental Algorithmics (JEA)
A study of replacement algorithms for a virtual-storage computer
IBM Systems Journal
Rodinia: A benchmark suite for heterogeneous computing
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
An adaptive performance modeling tool for GPU architectures
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
CUDAlign: using GPU to accelerate the comparison of megabase genomic sequences
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
User-input dependence analysis via graph reachability
User-input dependence analysis via graph reachability
Optimal register allocation for SSA-form programs in polynomial time
Information Processing Letters
Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs
Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
IEEE Micro
A GPGPU compiler for memory optimization and parallelism management
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Proceedings of the 24th ACM International Conference on Supercomputing
Dynamic warp subdivision for integrated branch and memory divergence tolerance
Proceedings of the 37th annual international symposium on Computer architecture
Understanding throughput-oriented architectures
Communications of the ACM
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
IP routing processing with graphic processors
Proceedings of the Conference on Design, Automation and Test in Europe
Dynamic detection of uniform and affine vectors in GPGPU computations
Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
EigenCFA: accelerating flow analysis with GPUs
Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
On-the-fly elimination of dynamic irregularities for GPU computing
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Some computer organizations and their effectiveness
IEEE Transactions on Computers
Reducing branch divergence in GPU programs
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Principles of Program Analysis
Principles of Program Analysis
Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators
Proceedings of the 38th annual international symposium on Computer architecture
A quantitative performance analysis model for GPU architectures
HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Divergence Analysis and Optimizations
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Extending a C-like language for portable SIMD programming
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
GPU programming in a high level language: compiling X10 to CUDA
Proceedings of the 2011 ACM SIGPLAN X10 Workshop
Compiling a high-level language for GPUs: (via language support for architectures and compilers)
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Adaptive input-aware compilation for graphics engines
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Improving performance of OpenCL on CPUs
CC'12 Proceedings of the 21st international conference on Compiler Construction
Spill code placement for SIMD machines
SBLP'12 Proceedings of the 16th Brazilian conference on Programming Languages
Divergence Analysis with Affine Constraints
SBAC-PAD '12 Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing
Convergence and scalarization for data-parallel architectures
CGO '13 Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
Hi-index | 0.00 |
Growing interest in graphics processing units has brought renewed attention to the Single Instruction Multiple Data (SIMD) execution model. SIMD machines give application developers tremendous computational power; however, programming them is still challenging. In particular, developers must deal with memory and control-flow divergences. These phenomena stem from a condition that we call data divergence, which occurs whenever two processing elements (PEs) see the same variable name holding different values. This article introduces divergence analysis, a static analysis that discovers data divergences. This analysis, currently deployed in an industrial quality compiler, is useful in several ways: it improves the translation of SIMD code to non-SIMD CPUs, it helps developers to manually improve their SIMD applications, and it also guides the automatic optimization of SIMD programs. We demonstrate this last point by introducing the notion of a divergence-aware register spiller. This spiller uses information from our analysis to either rematerialize or share common data between PEs. As a testimony of its effectiveness, we have tested it on a suite of 395 CUDA kernels from well-known benchmarks. The divergence-aware spiller produces GPU code that is 26.21% faster than the code produced by the register allocator used in the baseline compiler.