A loop accelerator for low power embedded VLIW processors

Authors:
Binu Mathew;Al Davis
Affiliations:
University of Utah, Salt Late City, UT;University of Utah, Salt Late City, UT
Venue:
Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Year:
2004

Citing 5
Cited 9

Principles of CMOS VLSI design: a systems perspective

Principles of CMOS VLSI design: a systems perspective
Iterative modulo scheduling: an algorithm for software pipelining loops

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Neural Network-Based Face Detection

IEEE Transactions on Pattern Analysis and Machine Intelligence
The Reconfigurable Streaming Vector Processor (RSVPTM)

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
The perception processor

The perception processor

A low power architecture for embedded perception

Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems
Interactive presentation: A decoupled architecture of processors with scratch-pad memory hierarchy

Proceedings of the conference on Design, automation and test in Europe
Modulo scheduling for highly customized datapaths to increase hardware reusability

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
VEAL: Virtualized Execution Accelerator for Loops

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Address Generation Optimization for Embedded High-Performance Processors: A Survey

Journal of Signal Processing Systems
The input-aware dynamic adaptation of area and performance for reconfigurable accelerator

Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Systematic architecture exploration based on optimistic cycle estimation for low energy embedded processors

Proceedings of the 2009 Asia and South Pacific Design Automation Conference
Decoupled Processors Architecture for Accelerating Data Intensive Applications using Scratch-Pad Memory Hierarchy

Journal of Signal Processing Systems
Rapid, low-power loop execution in a network of functional units

Proceedings of the 17th Panhellenic Conference on Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

The high transistor density afforded by modern VLSI processes have enabled the design of embedded processors that use clustered execution units to deliver high levels of performance. However, delivering data to the execution resources in a timely manner remains a major problem that limits ILP. It is particularly significant for embedded systems where memory and power budgets are limited. A distributed address generation and loop acceleration architecture for VLIW processors is presented. This decentralized on-chip memory architecture uses multiple SRAMs to provide high intra-processor bandwidth. Each SRAM has an associated stream address generator capable of implementing a variety of addressing modes in conjunction with a shared loop accelerator.The architecture is extremely useful for generating application specific embedded processors, particularly for processing input data which is organized as a stream. The idea is evaluated in the context of a fine grain VLIW architecture executing complex perception algorithms such as speech and visual feature recognition. Transistor level Spice simulations are used to demonstrate a 159x improvement in the energy delay product when compared to conventional architectures executing the same applications.