A VLIW architecture for a trace Scheduling Compiler
IEEE Transactions on Computers - Special issue on architectural support for programming languages and operating systems
Instruction-processing optimization techniques for VLSI microprocessors
Instruction-processing optimization techniques for VLSI microprocessors
The multiscalar architecture
Instruction fetch mechanisms for VLIW architectures with compressed encodings
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Complexity-effective superscalar processors
Proceedings of the 24th annual international symposium on Computer architecture
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Instruction buffering to reduce power in processors for signal processing
IEEE Transactions on Very Large Scale Integration (VLSI) Systems - Special issue on low power electronics and design
ISLPED '98 Proceedings of the 1998 international symposium on Low power electronics and design
Memory exploration for low power, embedded systems
Proceedings of the 36th annual ACM/IEEE Design Automation Conference
Selective instruction compression for memory energy reduction in embedded systems
ISLPED '99 Proceedings of the 1999 international symposium on Low power electronics and design
ISLPED '99 Proceedings of the 1999 international symposium on Low power electronics and design
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Filtering Memory References to Increase Energy Efficiency
IEEE Transactions on Computers
Code compression for low power embedded system design
Proceedings of the 37th Annual Design Automation Conference
Wattch: a framework for architectural-level power analysis and optimizations
Proceedings of the 27th annual international symposium on Computer architecture
Lx: a technology platform for customizable VLIW embedded processing
Proceedings of the 27th annual international symposium on Computer architecture
A power reduction technique with object code merging for application specific embedded processors
DATE '00 Proceedings of the conference on Design, automation and test in Europe
Compiler techniques for code compaction
ACM Transactions on Programming Languages and Systems (TOPLAS)
Modulo scheduling for a fully-distributed clustered VLIW architecture
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Inherently Lower-Power High-Performance Superscalar Architectures
IEEE Transactions on Computers
High-quality operation binding for clustered VLIW datapaths
Proceedings of the 38th annual Design Automation Conference
Power-aware partitioned cache architectures
ISLPED '01 Proceedings of the 2001 international symposium on Low power electronics and design
Compiler optimization on instruction scheduling for low power
ISSS '00 Proceedings of the 13th international symposium on System synthesis
Reducing set-associative cache energy via way-prediction and selective direct-mapping
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Enhancing loop buffering of media and telecommunications applications using low-overhead predication
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Proceedings of the 2001 IEEE/ACM international conference on Computer-aided design
Deep-Submicron Microprocessor Design Issues
IEEE Micro
Extensions to Programmable DSP architectures for Reduced Power Dissipation
VLSID '98 Proceedings of the Eleventh International Conference on VLSI Design: VLSI for Signal Processing
Effective Hardware-Based Two-Way Loop Cache for High Performance Low Power Processors
ICCD '00 Proceedings of the 2000 IEEE International Conference on Computer Design: VLSI in Computers & Processors
Dynamic Loop Caching Meets Preloaded Loop Caching " A Hybrid Approach
ICCD '02 Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'02)
Compressed Code Execution on DSP Architectures
Proceedings of the 12th international symposium on System synthesis
A Code Transformation-Based Methodology for Improving I-Cache Performance of DSP Applications
Proceedings of the conference on Design, automation and test in Europe
An Efficient Compiler Technique for Code Size Reduction Using Reduced Bit-Width ISAs
Proceedings of the conference on Design, automation and test in Europe
Assigning Program and Data Objects to Scratchpad for Energy Reduction
Proceedings of the conference on Design, automation and test in Europe
Design of a Predictive Filter Cache for Energy Savings in High Performance Processor Architectures
ICCD '01 Proceedings of the International Conference on Computer Design: VLSI in Computers & Processors
Exploiting Fixed Programs in Embedded Systems: A Loop Cache Example
IEEE Computer Architecture Letters
Distributed loop controller architecture for multi-threading in uni-threaded VLIW processors
Proceedings of the conference on Design, automation and test in Europe: Proceedings
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Integration, the VLSI Journal
Efficient Method to Generate an Energy Efficient Schedule Using Operation Shuffling
IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences
Playing the trade-off game: Architecture exploration using Coffeee
ACM Transactions on Design Automation of Electronic Systems (TODAES)
COFFEE: compiler framework for energy-aware exploration
HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
Fine-grain dynamic instruction placement for L0 scratch-pad memory
CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
A configuration memory hierarchy for fast reconfiguration with reduced energy consumption overhead
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
PATMOS'06 Proceedings of the 16th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation
Journal of Signal Processing Systems
Hi-index | 14.98 |
Current loop buffer organizations for very large instruction word processors are essentially centralized. As a consequence, they are energy inefficient and their scalability is limited. To alleviate this problem, we propose a clustered loop buffer organization, where the loop buffers are partitioned and functional units are logically grouped to form clusters, along with two schemes for buffer control which regulate the activity in each cluster. Furthermore, we propose a design-time scheme to generate clusters by analyzing an application profile and grouping closely related functional units. The simulation results indicate that the energy consumed in the clustered loop buffers is, on average, 63 percent lower than the energy consumed in an uncompressed centralized loop buffer scheme, 35 percent lower than a centralized compressed loop buffer scheme, and 22 percent lower than a randomly clustered loop buffer scheme.