Fortran at ten gigaflops: the connection machine convolution compiler
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Communication-free hyperplane partitioning of nested loops
Journal of Parallel and Distributed Computing
Mobile and replicated alignment of arrays in data-parallel programs
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Techniques for compiling programs on distributed memory multicomputers
Parallel Computing
Fusion of Loops for Parallelism and Locality
IEEE Transactions on Parallel and Distributed Systems
Tiling optimizations for 3D scientific computations
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Eliminating redundancies in sum-of-product array computations
ICS '01 Proceedings of the 15th international conference on Supercomputing
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Achieving Scalable Locality with Time Skewing
International Journal of Parallel Programming
Redundant Computation Partition on Distributed-Memory Systems
ICA3PP '02 Proceedings of the Fifth International Conference on Algorithms and Architectures for Parallel Processing
Compact thermal modeling for temperature-aware design
Proceedings of the 41st annual Design Automation Conference
Automatic tiling of iterative stencil loops
ACM Transactions on Programming Languages and Systems (TOPLAS)
Cache oblivious stencil computations
Proceedings of the 19th annual international conference on Supercomputing
Impact of modern memory subsystems on cache optimizations for stencil computations
Proceedings of the 2005 workshop on Memory system performance
Chip multiprocessing and the cell broadband engine
Proceedings of the 3rd conference on Computing frontiers
Three-dimensional multi-relaxation time (MRT) lattice-Boltzmann models for multiphase flow
Journal of Computational Physics
Effective automatic parallelization of stencil computations
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Scalable Parallel Programming with CUDA
Queue - GPU Computing
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Positivity, posynomials and tile size selection
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Parallel Image Processing Based on CUDA
CSSE '08 Proceedings of the 2008 International Conference on Computer Science and Software Engineering - Volume 03
Efficient simulation of agent-based models on multi-GPU and multi-core clusters
Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques
Data layout transformation for stencil computations on short-vector SIMD architectures
CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
GROPHECY: GPU performance projection from CPU code skeletons
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
A performance analysis framework for identifying potential benefits in GPGPU applications
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Parallelizing SOR for GPGPUs using alternate loop tiling
Parallel Computing
Fast seismic modeling and reverse time migration on a graphics processing unit cluster
Concurrency and Computation: Practice & Experience
Adaptive input-aware compilation for graphics engines
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Auto-generation and auto-tuning of 3D stencil codes on GPU clusters
Proceedings of the Tenth International Symposium on Code Generation and Optimization
Hierarchical overlapped tiling
Proceedings of the Tenth International Symposium on Code Generation and Optimization
Optimizing linpack benchmark on GPU-accelerated petascale supercomputer
Journal of Computer Science and Technology - Special issue on Community Analysis and Information Recommendation
A compiler-assisted runtime-prefetching scheme for heterogeneous platforms
IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
The performance model of an enhanced parallel algorithm for the SOR method
ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part I
Dataflow-driven GPU performance projection for multi-kernel transformations
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
The performance model for a parallel SOR algorithm using the red-black scheme
International Journal of High Performance Systems Architecture
From physics model to results: an optimizing framework for cross-architecture code generation
Proceedings of the Extreme Scaling Workshop
Proceedings of the 50th Annual Design Automation Conference
An application-centric evaluation of OpenCL on multi-core CPUs
Parallel Computing
Accelerating Single Iteration Performance of CUDA-Based 3D Reaction---Diffusion Simulations
International Journal of Parallel Programming
Hi-index | 0.00 |
Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique to localize their computation. When ISLs are tiled across a parallel architecture, there are usually halo regions that need to be updated and exchanged among different processing elements (PEs). In addition, synchronization is often used to signal the completion of halo exchanges. Both communication and synchronization may incur significant overhead on parallel architectures with shared memory. This is especially true in the case of graphics processors (GPUs), which do not preserve the state of the per-core L1 storage across global synchronizations. To reduce these overheads, ghost zones can be created to replicate stencil operations, reducing communication and synchronization costs at the expense of redundantly computing some values on multiple PEs. However, the selection of the optimal ghost zone size depends on the characteristics of both the architecture and the application, and it has only been studied for message-passing systems in a grid environment. To automate this process on shared memory systems, we establish a performance model using NVIDIA's Tesla architecture as a case study and propose a framework that uses the performance model to automatically select the ghost zone size that performs best and generate appropriate code. The modeling is validated by four diverse ISL applications, for which the predicted ghost zone configurations are able to achieve a speedup no less than 98% of the optimal speedup.