Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Understanding sources of inefficiency in general-purpose chips
Proceedings of the 37th annual international symposium on Computer architecture
Fine-grain dynamic instruction placement for L0 scratch-pad memory
CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Understanding sources of ineffciency in general-purpose chips
Communications of the ACM
NANOARCH '11 Proceedings of the 2011 IEEE/ACM International Symposium on Nanoscale Architectures
Energy efficient special instruction support in an embedded processor with compact isa
Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems
Convolution engine: balancing efficiency & flexibility in specialized computing
Proceedings of the 40th Annual International Symposium on Computer Architecture
Systematic evaluation of workload clustering for extremely energy-efficient architectures
ACM SIGARCH Computer Architecture News
Scheduling for register file energy minimization in explicit datapath architectures
DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
ACM Transactions on Architecture and Code Optimization (TACO)
Energy efficient computation: A silicon perspective
Integration, the VLSI Journal
Hi-index | 0.02 |
We present an efficient programmable architecture for compute-intensive embedded applications. The processor architecture uses instruction registers to reduce the cost of delivering instructions, and a hierarchical and distributed data register organization to deliver data. Instruction registers capture instruction reuse and locality in inexpensive storage structures that are located near to the functional units. The data register organization captures reuse and locality in different levels of the hierarchy to reduce the cost of delivering data. Exposed communication resources eliminate pipeline registers and control logic, and allow the compiler to schedule efficient instruction and data movement. The architecture keeps a significant fraction of instruction and data bandwidth local to the functional units, which reduces the cost of supplying instructions and data to large numbers of functional units. This architecture achieves an energy efficiency that is 23脳 greater than an embedded RISC processor.