Understanding sources of ineffciency in general-purpose chips

Authors:
Rehan Hameed;Wajahat Qadeer;Megan Wachs;Omid Azizi;Alex Solomatnikov;Benjamin C. Lee;Stephen Richardson;Christos Kozyrakis;Mark Horowitz
Affiliations:
Stanford University, Stanford, CA;Stanford University, Stanford, CA;Stanford University, Stanford, CA;Hicamp Systems, Menlo Park, CA;Hicamp Systems, Menlo Park, CA;Duke University, Durham, NC;Stanford University, Stanford, CA;Stanford University, Stanford, CA;Stanford University, Stanford, CA
Venue:
Communications of the ACM
Year:
2011

Citing 10
Cited 2

Application-specific instruction generation for configurable processor architectures

FPGA '04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays
Flexible architectures for engineering successful SOCs

Proceedings of the 41st annual Design Automation Conference
Automated Custom Instruction Generation for Domain-Specific Processor Acceleration

IEEE Transactions on Computers
Scaling, Power and the Future of CMOS

VLSID '07 Proceedings of the 20th International Conference on VLSI Design held jointly with 6th International Conference: Embedded Systems
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
An Energy-Efficient Processor Architecture for Embedded Systems

IEEE Computer Architecture Letters
AnySP: anytime anywhere anyway signal processing

Proceedings of the 36th annual international symposium on Computer architecture
Rethinking Digital Design: Why Design Must Change

IEEE Micro
Dark silicon and the end of multicore scaling

Proceedings of the 38th annual international symposium on Computer architecture
Analysis and architecture design of an HDTV720p 30 frames/s H.264/AVC encoder

IEEE Transactions on Circuits and Systems for Video Technology

Run-time adaption for highly-complex multi-core systems

Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis
Instruction set extensions for dynamic time warping

Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis

Quantified Score

Hi-index	48.22

Visualization

Abstract

Scaling the performance of a power limited processor requires decreasing the energy expended per instruction executed, since energy/op * op/second is power. To better understand what improvement in processor efficiency is possible, and what must be done to capture it, we quantify the sources of the performance and energy overheads of a 720p HD H.264 encoder running on a general-purpose four-processor CMP system. The initial overheads are large: the CMP was 500 x less energy efficient than an Application Specific Integrated Circuit (ASIC) doing the same job. We explore methods to eliminate these overheads by transforming the CPU into a specialized system for H.264 encoding. Broadly applicable optimizations like single instruction, multiple data (SIMD) units improve CMP performance by 14 x and energy by 10x, which is still 50x worse than an ASIC. The problem is that the basic operation costs in H.264 are so small that even with a SIMD unit doing over 10 ops per cycle, 90% of the energy is still overhead. Achieving ASIC-like performance and effciency requires algorithm-specifc optimizations. For each subalgorithm of H.264, we create a large, specialized functional/storage unit capable of executing hundreds of operations per instruction. This improves energy effciency by 160x (instead of 10x), and the final customized CMP reaches the same performance and within 3x of an ASIC solution's energy in comparable area.