The superblock: an effective technique for VLIW and superscalar compilation
The Journal of Supercomputing - Special issue on instruction-level parallelism
A high-performance microarchitecture with hardware-programmable functional units
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Putting the fill unit to work: dynamic optimizations for trace cache microprocessors
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
CHIMAERA: a high-performance architecture with a tightly-coupled reconfigurable functional unit
Proceedings of the 27th annual international symposium on Computer architecture
rePLay: A Hardware Framework for Dynamic Optimization
IEEE Transactions on Computers
PICO-NPA: High-Level Synthesis of Nonprogrammable Hardware Accelerators
Journal of VLSI Signal Processing Systems
Garp: a MIPS processor with a reconfigurable coprocessor
FCCM '97 Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines
Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture
Proceedings of the 30th annual international symposium on Computer architecture
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Exploring the design space of LUT-based transparent accelerators
Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems
The H.264 Video Coding Standard
IEEE MultiMedia
Efficient high-performance ASIC implementation of JPEG-LS encoder
Proceedings of the conference on Design, automation and test in Europe
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
VEAL: Virtualized Execution Accelerator for Loops
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Achieving Out-of-Order Performance with Almost In-Order Complexity
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Computer
From SODA to scotch: The evolution of a wireless baseband processor
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Conservation cores: reducing the energy of mature computations
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Understanding sources of inefficiency in general-purpose chips
Proceedings of the 37th annual international symposium on Computer architecture
Dynamically Specialized Datapaths for energy efficient computing
HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
LEAP: latency- energy- and area-optimized lookup pipeline
Proceedings of the eighth ACM/IEEE symposium on Architectures for networking and communications systems
Neural Acceleration for General-Purpose Approximate Programs
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
A general constraint-centric scheduling framework for spatial architectures
Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Systematic evaluation of workload clustering for extremely energy-efficient architectures
ACM SIGARCH Computer Architecture News
Toward application-specific memory reconfiguration for energy efficiency
E2SC '13 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing
SWSL: software synthesis for network lookup
ANCS '13 Proceedings of the ninth ACM/IEEE symposium on Architectures for networking and communications systems
Meet the walkers: accelerating index traversals for in-memory databases
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Evaluator-executor transformation for efficient pipelining of loops with conditionals
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
Technology scaling has delivered on its promises of increasing device density on a single chip. However, the voltage scaling trend has failed to keep up, introducing tight power constraints on manufactured parts. In such a scenario, there is a need to incorporate energy-efficient processing resources that can enable more computation within the same power budget. Energy efficiency solutions in the past have typically relied on application specific hardware and accelerators. Unfortunately, these approaches do not extend to general purpose applications due to their irregular and diverse code base. Towards this end, we propose BERET, an energy-efficient co-processor that can be configured to benefit a wide range of applications. Our approach identifies recurring instruction sequences as phases of "temporal regularity" in a program's execution, and maps suitable ones to the BERET hardware, a three-stage pipeline with a bundled execution model. This judicious off-loading of program execution to a reduced-complexity hardware demonstrates significant savings on instruction fetch, decode and register file accesses energy. On average, BERET reduces energy consumption by a factor of 3-4X for the program regions selected across a range of general-purpose and media applications. The average energy savings for the entire application run was 35% over a single-issue in-order processor.