Partitioned register files for VLIWs: a preliminary analysis of tradeoffs
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
The multicluster architecture: reducing cycle time through partitioning
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
25 years of the international symposia on Computer architecture (selected papers)
Effective cluster assignment for modulo scheduling
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Clustered speculative multithreaded processors
ICS '99 Proceedings of the 13th international conference on Supercomputing
Lx: a technology platform for customizable VLIW embedded processing
Proceedings of the 27th annual international symposium on Computer architecture
A static power model for architects
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Cache decay: exploiting generational behavior to reduce cache leakage power
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Power-aware modulo scheduling for high-performance VLIW processors
ISLPED '01 Proceedings of the 2001 international symposium on Low power electronics and design
Low swing dual threshold voltage domino logic
Proceedings of the 12th ACM Great Lakes symposium on VLSI
Drowsy caches: simple techniques for reducing leakage power
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Exploiting VLIW schedule slacks for dynamic and leakage energy reduction
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Graph-partitioning based instruction scheduling for clustered processors
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Cluster assignment for high-performance embedded VLIW processors
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Power-Driven Challenges in Nanometer Design
IEEE Design & Test
The TigerSHARC DSP Architecture
IEEE Micro
Power: A First Class Design Constraint for Future Architecture and Automation
HiPC '00 Proceedings of the 7th International Conference on High Performance Computing
Efficient Interconnects for Clustered Microarchitectures
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Optimizing Static Power Dissipation by Functional Units in Superscalar Processors
CC '02 Proceedings of the 11th International Conference on Compiler Construction
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Managing static leakage energy in microprocessor functional units
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Adapting instruction level parallelism for optimizing leakage in VLIW architectures
Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems
Region-based hierarchical operation partitioning for multicluster processors
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Inter-Cluster Communication Models for Clustered VLIW Processors
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Efficient Backtracking Instruction Schedulers
PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
Instruction Scheduling for Clustered VLIW DSPs
PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
ICCD '02 Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'02)
CARS: A New Code Generation Framework for Clustered ILP Processors
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Integrated temporal and spatial scheduling for extended operand clustered VLIW processors
Proceedings of the 1st conference on Computing frontiers
A Graph Matching Based Integrated Scheduling Framework for Clustered VLIW Processors
ICPPW '04 Proceedings of the 2004 International Conference on Parallel Processing Workshops
Microarchitectural Wire Management for Performance and Power in Partitioned Architectures
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Power reduction techniques for microprocessor systems
ACM Computing Surveys (CSUR)
MiBench: A free, commercially representative embedded benchmark suite
WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop
Compiler-assisted leakage energy optimization for clustered VLIW architectures
EMSOFT '06 Proceedings of the 6th ACM & IEEE International conference on Embedded software
INTACTE: an interconnect area, delay, and energy estimation tool for microarchitectural explorations
CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems
Exploring energy-performance trade-offs for heterogeneous interconnect clustered VLIW processors
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Hi-index | 0.00 |
Clustered architecture processors are preferred for embedded systems because centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption. Although clustering helps by improving the clock speed, reducing the energy consumption of the logic, and making the design simpler, it introduces extra overheads by way of inter-cluster communication. This communication happens over long global wires having high load capacitance which leads to delay in execution and significantly high energy consumption. Inter-cluster communication also introduces many short idle cycles, thereby significantly increasing the overall leakage energy consumption in the functional units. The trend towards miniaturization of devices (and associated reduction in threshold voltage) makes energy consumption in interconnects and functional units even worse, and limits the usability of clustered architectures in smaller technologies. However, technological advancements now permit the design of interconnects and functional units with varying performance and power modes. In this paper, we propose scheduling algorithms that aggregate the scheduling slack of instructions and communication slack of data values to exploit the low-power modes of functional units and interconnects. Finally, we present a synergistic combination of these algorithms that simultaneously saves energy in functional units and interconnects to improves the usability of clustered architectures by achieving better overall energy-performance trade-offs. Even with conservative estimates of the contribution of the functional units and interconnects to the overall processor energy consumption, the proposed combined scheme obtains on average 8% and 10% improvement in overall energy-delay product with 3.5% and 2% performance degradation for a 2-clustered and a 4-clustered machine, respectively. We present a detailed experimental evaluation of the proposed schemes. Our test bed uses the Trimaran compiler infrastructure.