Distributed loop controller architecture for multi-threading in uni-threaded VLIW processors

Authors:
Praveen Raghavan;Andy Lambrechts;Murali Jayapala;Francky Catthoor;Diederik Verkest
Affiliations:
IMEC vzw, Kapeldreef, Leuven, Belgium;IMEC vzw, Kapeldreef, Leuven, Belgium;IMEC vzw, Kapeldreef, Leuven, Belgium;IMEC vzw, Kapeldreef, Leuven, Belgium;IMEC vzw, Kapeldreef, Leuven, Belgium
Venue:
Proceedings of the conference on Design, automation and test in Europe: Proceedings
Year:
2006

Citing 15
Cited 6

Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Instruction fetch mechanisms for VLIW architectures with compressed encodings

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Generation of Efficient Nested Loops from Polyhedra

International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, part 2
Comparing power consumption of an SMT and a CMP DSP for mobile phone workloads

CASES '01 Proceedings of the 2001 international conference on Compilers, architecture, and synthesis for embedded systems
Enhancing loop buffering of media and telecommunications applications using low-overhead predication

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Weld: A Multithreading Technique Towards Latency-Tolerant VLIW Processors

HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
Synthesis of customized loop caches for core-based embedded systems

Proceedings of the 2002 IEEE/ACM international conference on Computer-aided design
Scratchpad memory: design alternative for cache on-chip memory in embedded systems

Proceedings of the tenth international symposium on Hardware/software codesign
An Efficient Compiler Technique for Code Size Reduction Using Reduced Bit-Width ISAs

Proceedings of the conference on Design, automation and test in Europe
Assigning Program and Data Objects to Scratchpad for Energy Reduction

Proceedings of the conference on Design, automation and test in Europe
Optimizing the Memory Bandwidth with Loop Morphing

ASAP '04 Proceedings of the Application-Specific Systems, Architectures and Processors, 15th IEEE International Conference
Clustered Loop Buffer Organization for Low Energy VLIW Embedded Processors

IEEE Transactions on Computers
Power Breakdown Analysis for a Heterogeneous NoC Platform Running a Video Application

ASAP '05 Proceedings of the 2005 IEEE International Conference on Application-Specific Systems, Architecture Processors
Partitioning Multi-Threaded Processors with a Large Number of Threads

ISPASS '05 Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2005
Compiler-directed scratch pad memory optimization for embedded multiprocessors

IEEE Transactions on Very Large Scale Integration (VLSI) Systems - Special section on the 2002 international symposium on low-power electronics and design (ISLPED)

Systematic intermediate sequence removal for reduced memory accesses

SCOPES '07 Proceedingsof the 10th international workshop on Software & compilers for embedded systems
Reducing complexity of multiobjective design space exploration in VLIW-based embedded systems

ACM Transactions on Architecture and Code Optimization (TACO)
Playing the trade-off game: Architecture exploration using Coffeee

ACM Transactions on Design Automation of Electronic Systems (TODAES)
COFFEE: compiler framework for energy-aware exploration

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
Survey of Low-Energy Techniques for Instruction Memory Organisations in Embedded Systems

Journal of Signal Processing Systems
Design Space Exploration of Distributed Loop Buffer Architectures with Incompatible Loop-Nest Organisations in Embedded Systems

Journal of Signal Processing Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Reduced energy consumption is one of the most important design goals for embedded application domains like wireless, multimedia and biomedical. Instruction memory hierarchy has been proven to be one of the most power hungry parts of the system. This paper introduces an architectural enhancement for the instruction memory to reduce energy and improve performance. The proposed distributed instruction memory organization requires minimal hardware overhead and allows execution of multiple loops in parallel in a uni-processor system. This architecture enhancement can reduce the energy consumed in the instruction and data memory hierarchy by 70.01% and improve the performance by 32.89% compared to enhanced SMT based architectures.