Compiler-assisted energy optimization for clustered VLIW processors

Authors:
Rahul Nagpal;Y. N. Srikant
Affiliations:
-;-
Venue:
Journal of Parallel and Distributed Computing
Year:
2012

Citing 41
Cited 0

Partitioned register files for VLIWs: a preliminary analysis of tradeoffs

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
The multicluster architecture: reducing cycle time through partitioning

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Multiscalar processors

25 years of the international symposia on Computer architecture (selected papers)
Effective cluster assignment for modulo scheduling

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Clustered speculative multithreaded processors

ICS '99 Proceedings of the 13th international conference on Supercomputing
Lx: a technology platform for customizable VLIW embedded processing

Proceedings of the 27th annual international symposium on Computer architecture
A static power model for architects

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Cache decay: exploiting generational behavior to reduce cache leakage power

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Power-aware modulo scheduling for high-performance VLIW processors

ISLPED '01 Proceedings of the 2001 international symposium on Low power electronics and design
Low swing dual threshold voltage domino logic

Proceedings of the 12th ACM Great Lakes symposium on VLSI
Drowsy caches: simple techniques for reducing leakage power

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Exploiting VLIW schedule slacks for dynamic and leakage energy reduction

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Graph-partitioning based instruction scheduling for clustered processors

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Cluster assignment for high-performance embedded VLIW processors

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Will Physical Scalability Sabotage Performance Gains?

Computer
Instruction-Level Distributed Processing

Computer
Power-Driven Challenges in Nanometer Design

IEEE Design & Test
The TigerSHARC DSP Architecture

IEEE Micro
Power: A First Class Design Constraint for Future Architecture and Automation

HiPC '00 Proceedings of the 7th International Conference on High Performance Computing
Efficient Interconnects for Clustered Microarchitectures

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Optimizing Static Power Dissipation by Functional Units in Superscalar Processors

CC '02 Proceedings of the 11th International Conference on Compiler Construction
Convergent scheduling

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Managing static leakage energy in microprocessor functional units

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Adapting instruction level parallelism for optimizing leakage in VLIW architectures

Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems
Region-based hierarchical operation partitioning for multicluster processors

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Inter-Cluster Communication Models for Clustered VLIW Processors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Efficient Backtracking Instruction Schedulers

PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
Instruction Scheduling for Clustered VLIW DSPs

PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
The Imagine Stream Processor

ICCD '02 Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'02)
CARS: A New Code Generation Framework for Clustered ILP Processors

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
An Integrated Circuit/Architecture Approach to Reducing Leakage in Deep-Submicron High-Performance I-Caches

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Integrated temporal and spatial scheduling for extended operand clustered VLIW processors

Proceedings of the 1st conference on Computing frontiers
A Graph Matching Based Integrated Scheduling Framework for Clustered VLIW Processors

ICPPW '04 Proceedings of the 2004 International Conference on Parallel Processing Workshops
Microarchitectural Wire Management for Performance and Power in Partitioned Architectures

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Power reduction techniques for microprocessor systems

ACM Computing Surveys (CSUR)
MiBench: A free, commercially representative embedded benchmark suite

WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop
Compiler-assisted leakage energy optimization for clustered VLIW architectures

EMSOFT '06 Proceedings of the 6th ACM & IEEE International conference on Embedded software
INTACTE: an interconnect area, delay, and energy estimation tool for microarchitectural explorations

CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems
Exploring energy-performance trade-offs for heterogeneous interconnect clustered VLIW processors

HiPC'06 Proceedings of the 13th international conference on High Performance Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustered architecture processors are preferred for embedded systems because centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption. Although clustering helps by improving the clock speed, reducing the energy consumption of the logic, and making the design simpler, it introduces extra overheads by way of inter-cluster communication. This communication happens over long global wires having high load capacitance which leads to delay in execution and significantly high energy consumption. Inter-cluster communication also introduces many short idle cycles, thereby significantly increasing the overall leakage energy consumption in the functional units. The trend towards miniaturization of devices (and associated reduction in threshold voltage) makes energy consumption in interconnects and functional units even worse, and limits the usability of clustered architectures in smaller technologies. However, technological advancements now permit the design of interconnects and functional units with varying performance and power modes. In this paper, we propose scheduling algorithms that aggregate the scheduling slack of instructions and communication slack of data values to exploit the low-power modes of functional units and interconnects. Finally, we present a synergistic combination of these algorithms that simultaneously saves energy in functional units and interconnects to improves the usability of clustered architectures by achieving better overall energy-performance trade-offs. Even with conservative estimates of the contribution of the functional units and interconnects to the overall processor energy consumption, the proposed combined scheme obtains on average 8% and 10% improvement in overall energy-delay product with 3.5% and 2% performance degradation for a 2-clustered and a 4-clustered machine, respectively. We present a detailed experimental evaluation of the proposed schemes. Our test bed uses the Trimaran compiler infrastructure.