Wattch: a framework for architectural-level power analysis and optimizations
Proceedings of the 27th annual international symposium on Computer architecture
The Energy Impact of Aggressive Loop Fusion
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
A compiler framework for optimization of affine loop nests for gpgpus
Proceedings of the 22nd annual international conference on Supercomputing
A performance study of general-purpose applications on graphics processors using CUDA
Journal of Parallel and Distributed Computing
Program optimization carving for GPU computing
Journal of Parallel and Distributed Computing
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
IEEE Transactions on Parallel and Distributed Systems
Prediction-Based Power-Performance Adaptation of Multithreaded Scientific Codes
IEEE Transactions on Parallel and Distributed Systems
Energy-Oriented OpenMP Parallel Loop Scheduling
ISPA '08 Proceedings of the 2008 IEEE International Symposium on Parallel and Distributed Processing with Applications
Temperature-constrained power control for chip multiprocessors with online model estimation
Proceedings of the 36th annual international symposium on Computer architecture
Power Consumption of GPUs from a Software Perspective
ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
On the energy efficiency of graphics processing units for scientific computing
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Program Optimization of Array-Intensive SPEC2k Benchmarks on Multithreaded GPU Using CUDA and Brook+
ICPADS '09 Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems
An integrated GPU power and performance model
Proceedings of the 37th annual international symposium on Computer architecture
ORION 2.0: a fast and accurate NoC power and area model for early-stage design space exploration
Proceedings of the Conference on Design, Automation and Test in Europe
Fine-grained resource sharing for concurrent GPGPU kernels
HotPar'12 Proceedings of the 4th USENIX conference on Hot Topics in Parallelism
Dataflow-driven GPU performance projection for multi-kernel transformations
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Improving GPGPU concurrency with elastic kernels
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Hi-index | 0.00 |
As one of the most popular accelerators, Graphics Processing Unit (GPU) has demonstrated high computing power in several application fields. On the other hand, GPU also produces high power consumption and has been one of the most largest power consumers in desktop and supercomputer systems. However, software power optimization method targeted for GPU has not been well studied. In this work, we propose kernel fusion method to reduce energy consumption and improve power efficiency on GPU architecture. Through fusing two or more independent kernels, kernel fusion method achieves higher utilization and much more balanced demand for hardware resources, which provides much more potential for power optimization, such as dynamic voltage and frequency scaling (DVFS). Basing on the CUDA programming model, this paper also gives several different fusion methods targeted for different situations. In order to make judicious fusion strategy, we deduce the process of fusing multiple independent kernels as a dynamic programming problem, which could be well solved with many existing tools and be simply embedded into compiler or runtime system. To reduce the overhead introduced by kernel fusion, we also propose effective method to reduce the usage of shared memory and coordinate the thread space of the kernels to be fused. Detailed experimental evaluation validates that the proposed kernel fusion method could reduce energy consumption without performance loss for several typical kernels.