The Energy Impact of Aggressive Loop Fusion

Authors:
YongKang Zhu;Grigorios Magklis;Michael L. Scott;Chen Ding;David H. Albonesi
Affiliations:
University of Rochester, New York;Intel Barcelona Research Center, Barcelona, Spain;University of Rochester, New York;University of Rochester, New York;University of Rochester, New York
Venue:
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Year:
2004

Citing 25
Cited 5

Automatic decomposition of scientific programs for parallel execution

POPL '87 Proceedings of the 14th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
Estimating interlock and improving balance for pipelined architectures

Journal of Parallel and Distributed Computing
Vector Register Allocation

IEEE Transactions on Computers
Compiler optimizations for improving data locality

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Improving the ratio of memory operations to floating-point operations in loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
Processor design for portable systems

Journal of VLSI Signal Processing Systems - Special issue on technologies for wireless computing
Voltage scheduling problem for dynamically variable voltage processors

ISLPED '98 Proceedings of the 1998 international symposium on Low power electronics and design
Influence of compiler optimizations on system power

Proceedings of the 37th Annual Design Automation Conference
Wattch: a framework for architectural-level power analysis and optimizations

Proceedings of the 27th annual international symposium on Computer architecture
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
A compiler approach to fast hardware design space exploration in FPGA-based systems

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Computation regrouping: restructuring programs for temporal data cache locality

ICS '02 Proceedings of the 16th international conference on Supercomputing
Power and performance evaluation of globally asynchronous locally synchronous processors

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
On achieving balanced power consumption in software pipelined loops

CASES '02 Proceedings of the 2002 international conference on Compilers, architecture, and synthesis for embedded systems
Collective Loop Fusion for Array Contraction

Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
Dynamic frequency and voltage control for a multiple clock domain microarchitecture

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
The design, implementation, and evaluation of a compiler algorithm for CPU energy reduction

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Compile-time composition of run-time data and iteration reorderings

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
A scheduling model for reduced CPU energy

FOCS '95 Proceedings of the 36th Annual Symposium on Foundations of Computer Science
Profile-based dynamic voltage and frequency scaling for a multiple clock domain microprocessor

Proceedings of the 30th annual international symposium on Computer architecture
Energy-Efficient Processor Design Using Multiple Clock Domains with Dynamic Voltage and Frequency Scaling

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Optimizing supercompilers for supercomputers

Optimizing supercompilers for supercomputers
Inter-Procedural Loop Fusion, Array Contraction and Rotation

Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
Improving effective bandwidth through compiler enhancement of global cache reuse

Journal of Parallel and Distributed Computing
Instruction balance and its relation to program energy consumption

LCPC'01 Proceedings of the 14th international conference on Languages and compilers for parallel computing

MPSoC memory optimization using program transformation

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Energy minimization with loop fusion and multi-functional-unit scheduling for multidimensional DSP

Journal of Parallel and Distributed Computing
Online energy-saving algorithm for sensor networks in dynamic changing environments

Journal of Embedded Computing
Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU

GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
Is software "green"? Application development environments and energy efficiency in open source applications

Information and Software Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Loop fusion combines corresponding iterations of different loops. It is traditionally used to decrease program run time, by reducing loop overhead and increasing data locality. In this paper, however, we consider its effect on energy. the uniformity, or balance of demand for system resources. On a conventional superscalar processor, increased balance tends to increase IPC, and thus dynamic power, so that fusion-induced improvements in program energy are slightly smaller than improvements in program run time. If IPC is held constant, however, by reducing frequency and voltage-particularly on a processor with multiple clock domains-then energy improvements may significantly exceed run time improvements. We demonstrate the benefits of increased program balance under a theoretical model of processor energy consumption. We then evaluate the benefits of fusion empirically on synthetic and real-world benchmarks, using our existing loop-fusing compiler and a heavily modified version of the SimpleScalar/Wattch simulator. For the real-world benchmarks, we demonstrate energy savings ranging from 7-40%, with run times ranging from 1% slowdown to 17% speedup. In addition to validating our theoretical model, the simulation results allow us to "tease apart" the factors that contribute to fusion-induced time and energy savings.