Energy-efficient mechanisms for managing thread context in throughput processors
Proceedings of the 38th annual international symposium on Computer architecture
A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors
ACM Transactions on Computer Systems (TOCS)
Neural Acceleration for General-Purpose Approximate Programs
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
APOGEE: adaptive prefetching on GPUs for energy efficiency
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
ACM Transactions on Architecture and Code Optimization (TACO)
EVA: an efficient vision architecture for mobile systems
Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
Hi-index | 14.98 |
Energy-efficient computation is critical if we are going to continue to scale performance in power-limited systems. For floating-point applications that have large amounts of data parallelism, one should optimize the {\rm throughput/mm}^{2} given a power density constraint. We present a method for creating a trade-off curve that can be used to estimate the maximum floating-point performance given a set of area and power constraints. Looking at FP multiply-add units and ignoring register and memory overheads, we find that in a 90 nm CMOS technology at 1 {\rm W/mm}^{2}, one can achieve a performance of {\rm 27 GFlops/mm}^{2} single precision, and {\rm 7.5 GFlops/mm}^{2} double precision. Adding register file overheads reduces the throughput by less than 50 percent if the compute intensity is high. Since the energy of the basic gates is no longer scaling rapidly, to maintain constant power density with scaling requires moving the overall FP architecture to a lower energy/performance point. A 1 {\rm W}/{\rm mm}^{2} design at 90 nm is a “high-energy” design, so scaling it to a lower energy design in 45 nm still yields a 7\times performance gain, while a more balanced 0.1 {\rm W/mm}^{2} design only speeds up by 3.5{\times} when scaled to 45 nm. Performance scaling below 45 nm rapidly decreases, with a projected improvement of only {\sim} 3{\times} for both power densities when scaling to a 22 nm technology.