Clustered Loop Buffer Organization for Low Energy VLIW Embedded Processors

Authors:
Murali Jayapala;Francisco Barat;Tom Vander Aa;Francky Catthoor;Henk Corporaal;Geert Deconinck
Affiliations:
IEEE;IEEE;IEEE;IEEE;-;IEEE
Venue:
IEEE Transactions on Computers
Year:
2005

Citing 37
Cited 10

A VLIW architecture for a trace Scheduling Compiler

IEEE Transactions on Computers - Special issue on architectural support for programming languages and operating systems
Instruction-processing optimization techniques for VLSI microprocessors

Instruction-processing optimization techniques for VLSI microprocessors
The multiscalar architecture

The multiscalar architecture
Instruction fetch mechanisms for VLIW architectures with compressed encodings

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Instruction buffering to reduce power in processors for signal processing

IEEE Transactions on Very Large Scale Integration (VLSI) Systems - Special issue on low power electronics and design
Architectural and compiler support for energy reduction in the memory hierarchy of high performance microprocessors

ISLPED '98 Proceedings of the 1998 international symposium on Low power electronics and design
Memory exploration for low power, embedded systems

Proceedings of the 36th annual ACM/IEEE Design Automation Conference
Selective instruction compression for memory energy reduction in embedded systems

ISLPED '99 Proceedings of the 1999 international symposium on Low power electronics and design
Instruction fetch energy reduction using loop caches for embedded applications with small tight loops

ISLPED '99 Proceedings of the 1999 international symposium on Low power electronics and design
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Filtering Memory References to Increase Energy Efficiency

IEEE Transactions on Computers
Code compression for low power embedded system design

Proceedings of the 37th Annual Design Automation Conference
Wattch: a framework for architectural-level power analysis and optimizations

Proceedings of the 27th annual international symposium on Computer architecture
Lx: a technology platform for customizable VLIW embedded processing

Proceedings of the 27th annual international symposium on Computer architecture
A power reduction technique with object code merging for application specific embedded processors

DATE '00 Proceedings of the conference on Design, automation and test in Europe
Compiler techniques for code compaction

ACM Transactions on Programming Languages and Systems (TOPLAS)
Modulo scheduling for a fully-distributed clustered VLIW architecture

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Inherently Lower-Power High-Performance Superscalar Architectures

IEEE Transactions on Computers
High-quality operation binding for clustered VLIW datapaths

Proceedings of the 38th annual Design Automation Conference
Power-aware partitioned cache architectures

ISLPED '01 Proceedings of the 2001 international symposium on Low power electronics and design
Compiler optimization on instruction scheduling for low power

ISSS '00 Proceedings of the 13th international symposium on System synthesis
Reducing set-associative cache energy via way-prediction and selective direct-mapping

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Enhancing loop buffering of media and telecommunications applications using low-overhead predication

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
I-CoPES: fast instruction code placement for embedded systems to improve performance and energy efficiency

Proceedings of the 2001 IEEE/ACM international conference on Computer-aided design
Changing Interaction of Compiler and Architecture

Computer
Deep-Submicron Microprocessor Design Issues

IEEE Micro
Extensions to Programmable DSP architectures for Reduced Power Dissipation

VLSID '98 Proceedings of the Eleventh International Conference on VLSI Design: VLSI for Signal Processing
Effective Hardware-Based Two-Way Loop Cache for High Performance Low Power Processors

ICCD '00 Proceedings of the 2000 IEEE International Conference on Computer Design: VLSI in Computers & Processors
Dynamic Loop Caching Meets Preloaded Loop Caching " A Hybrid Approach

ICCD '02 Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'02)
Compressed Code Execution on DSP Architectures

Proceedings of the 12th international symposium on System synthesis
A Code Transformation-Based Methodology for Improving I-Cache Performance of DSP Applications

Proceedings of the conference on Design, automation and test in Europe
An Efficient Compiler Technique for Code Size Reduction Using Reduced Bit-Width ISAs

Proceedings of the conference on Design, automation and test in Europe
Assigning Program and Data Objects to Scratchpad for Energy Reduction

Proceedings of the conference on Design, automation and test in Europe
Design of a Predictive Filter Cache for Energy Savings in High Performance Processor Architectures

ICCD '01 Proceedings of the International Conference on Computer Design: VLSI in Computers & Processors
Exploiting Fixed Programs in Embedded Systems: A Loop Cache Example

IEEE Computer Architecture Letters

Distributed loop controller architecture for multi-threading in uni-threaded VLIW processors

Proceedings of the conference on Design, automation and test in Europe: Proceedings
Methodology for operation shuffling and L0 cluster generation for low energy heterogeneous VLIW processors

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Joint hardware-software leakage minimization approach for the register file of VLIW embedded architectures

Integration, the VLSI Journal
Efficient Method to Generate an Energy Efficient Schedule Using Operation Shuffling

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences
Playing the trade-off game: Architecture exploration using Coffeee

ACM Transactions on Design Automation of Electronic Systems (TODAES)
COFFEE: compiler framework for energy-aware exploration

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
Fine-grain dynamic instruction placement for L0 scratch-pad memory

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
A configuration memory hierarchy for fast reconfiguration with reduced energy consumption overhead

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Software simultaneous multi-threading, a technique to exploit task-level parallelism to improve instruction- and data-level parallelism

PATMOS'06 Proceedings of the 16th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation
Design Space Exploration of Distributed Loop Buffer Architectures with Incompatible Loop-Nest Organisations in Embedded Systems

Journal of Signal Processing Systems

Quantified Score

Hi-index	14.98

Visualization

Abstract

Current loop buffer organizations for very large instruction word processors are essentially centralized. As a consequence, they are energy inefficient and their scalability is limited. To alleviate this problem, we propose a clustered loop buffer organization, where the loop buffers are partitioned and functional units are logically grouped to form clusters, along with two schemes for buffer control which regulate the activity in each cluster. Furthermore, we propose a design-time scheme to generate clusters by analyzing an application profile and grouping closely related functional units. The simulation results indicate that the energy consumed in the clustered loop buffers is, on average, 63 percent lower than the energy consumed in an uncompressed centralized loop buffer scheme, 35 percent lower than a centralized compressed loop buffer scheme, and 22 percent lower than a randomly clustered loop buffer scheme.