Tiling optimizations for 3D scientific computations
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Loop fusion for memory space optimization
Proceedings of the 14th international symposium on Systems synthesis
Predictive performance and scalability modeling of a large-scale application
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
An Implementation of Interprocedural Bounded Regular Section Analysis
IEEE Transactions on Parallel and Distributed Systems
A framework for performance modeling and prediction
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Prophesy: an infrastructure for performance analysis and modeling of parallel and grid applications
ACM SIGMETRICS Performance Evaluation Review
Cross-Platform Performance Prediction of Parallel Applications Using Partial Execution
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Accurate and efficient regression modeling for microarchitectural performance and power prediction
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Performance Modeling of the Blue Gene Architecture
JVA '06 Proceedings of the IEEE John Vincent Atanasoff 2006 International Symposium on Modern Computing
Methods of inference and learning for performance modeling of parallel applications
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
A performance study of general-purpose applications on graphics processors using CUDA
Journal of Parallel and Distributed Computing
CUDA-Lite: Reducing GPU Programming Complexity
Languages and Compilers for Parallel Computing
Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs
Proceedings of the 23rd international conference on Supercomputing
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness
Proceedings of the 36th annual international symposium on Computer architecture
An adaptive performance modeling tool for GPU architectures
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Model-driven autotuning of sparse matrix-vector multiply on GPUs
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Implementing the PGI Accelerator model
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
OpenMPC: Extended OpenMP Programming and Tuning for GPUs
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Generating Performance Bounds from Source Code
ICPPW '10 Proceedings of the 2010 39th International Conference on Parallel Processing Workshops
Loop transformations: convexity, pruning and optimization
Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures
Concurrency and Computation: Practice & Experience - Euro-Par 2009
Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU
GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
Automatic CPU-GPU communication management and optimization
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Modeling the performance of an algebraic multigrid cycle on HPC platforms
Proceedings of the international conference on Supercomputing
An idiom-finding tool for increasing productivity of accelerators
Proceedings of the international conference on Supercomputing
Mint: realizing CUDA performance in 3D stencil methods with annotated C
Proceedings of the international conference on Supercomputing
MDR: performance model driven runtime for heterogeneous parallel platforms
Proceedings of the international conference on Supercomputing
A quantitative performance analysis model for GPU architectures
HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Bounding the effect of partition camping in GPU kernels
Proceedings of the 8th ACM International Conference on Computing Frontiers
Performance modeling for systematic performance tuning
State of the Practice Reports
GROPHECY: GPU performance projection from CPU code skeletons
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
A performance analysis framework for identifying potential benefits in GPGPU applications
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Automatic C-to-CUDA code generation for affine programs
CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Hi-index | 0.00 |
Applications often have a sequence of parallel operations to be offloaded to graphics processors; each operation can become an individual GPU kernel. Developers typically explore a variety of transformations for each kernel. Furthermore, it is well known that efficient data management is critical in achieving high GPU performance and that "fusing" multiple kernels into one may greatly improve data locality. Doing so, however, requires transformations across multiple, potentially nested, parallel loops; at the same time, the original code semantics and data dependency must be preserved. Since each kernel may have distinct data access patterns, their combined dataflow can be nontrivial. As a result, the complexity of multi-kernel transformations often leads to significant effort with no guarantee of performance benefits. This paper proposes a dataflow-driven analytical framework to project GPU performance for a sequence of parallel operations. Users need only provide CPU code skeletons for a sequence of parallel loops. The framework can then automatically identify opportunities for multi-kernel transformations and data management. It is also able to project the overall performance without implementing GPU code or using physical hardware.