Understanding sources of inefficiency in general-purpose chips

Authors:
Rehan Hameed;Wajahat Qadeer;Megan Wachs;Omid Azizi;Alex Solomatnikov;Benjamin C. Lee;Stephen Richardson;Christos Kozyrakis;Mark Horowitz
Affiliations:
Stanford University, Stanford, CA, USA;Stanford University, Stanford, CA, USA;Stanford University, Stanford, CA, USA;Stanford University, Stanford, CA, USA;Hicamp Systems, Menlo Park, CA, USA;Stanford University, Stanford, CA, USA;Stanford University, Stanford, CA, USA;Stanford University, Stanford, CA, USA;Stanford University, Stanford, CA, USA
Venue:
Proceedings of the 37th annual international symposium on Computer architecture
Year:
2010

Citing 18
Cited 39

Application-specific instruction generation for configurable processor architectures

FPGA '04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays
Flexible architectures for engineering successful SOCs

Proceedings of the 41st annual Design Automation Conference
Automated Custom Instruction Generation for Domain-Specific Processor Acceleration

IEEE Transactions on Computers
Scientific applications vs. SPEC-FP: a comparison of program behavior

Proceedings of the 20th annual international conference on Supercomputing
Customizable Embedded Processors: Design Technologies and Applications

Customizable Embedded Processors: Design Technologies and Applications
A VLSI architecture design of an edge based fast intra prediction mode decision algorithm for h.264/avc

Proceedings of the 17th ACM Great Lakes symposium on VLSI
Scaling, Power and the Future of CMOS

VLSID '07 Proceedings of the 20th International Conference on VLSI Design held jointly with 6th International Conference: Embedded Systems
Characteristics of workloads used in high performance and technical computing

Proceedings of the 21st annual international conference on Supercomputing
Chip multi-processor generator

Proceedings of the 44th annual Design Automation Conference
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
An Energy-Efficient Processor Architecture for Embedded Systems

IEEE Computer Architecture Letters
A 242mW, 10mm21080p H.264/AVC high profile encoder chip

Proceedings of the 45th annual Design Automation Conference
AnySP: anytime anywhere anyway signal processing

Proceedings of the 36th annual international symposium on Computer architecture
A memory system design framework: creating smart memories

Proceedings of the 36th annual international symposium on Computer architecture
Using a configurable processor generator for computer architecture prototyping

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Overview of the H.264/AVC video coding standard

IEEE Transactions on Circuits and Systems for Video Technology
Analysis and architecture design of an HDTV720p 30 frames/s H.264/AVC encoder

IEEE Transactions on Circuits and Systems for Video Technology
High-Throughput Architecture for H.264/AVC CABAC Compression System

IEEE Transactions on Circuits and Systems for Video Technology

Hardware implementation of micropolygon rasterization with motion and defocus blur

Proceedings of the Conference on High Performance Graphics
The future of microprocessors

Communications of the ACM
A novel thread scheduler design for polymorphic embedded systems

CASES '11 Proceedings of the 14th international conference on Compilers, architectures and synthesis for embedded systems
An energy-efficient patchable accelerator for post-silicon engineering changes

CODES+ISSS '11 Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Liszt: a domain specific language for building portable mesh-based PDE solvers

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
The accelerator store: A shared memory framework for accelerator-based systems

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Randomized accuracy-aware program transformations for efficient approximate computations

POPL '12 Proceedings of the 39th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Compiling high throughput network processors

Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays
Post-silicon debugging targeting electrical errors with patchable controllers (abstract only)

Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays
Clearing the clouds: a study of emerging scale-out workloads on modern hardware

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Bundled execution of recurring traces for energy-efficient general purpose processing

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Idempotent processor architecture

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
QsCores: trading dark silicon for scalable energy efficiency with quasi-specific cores

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
VISION: cloud-powered sight for all: showing the cloud what you see

Proceedings of the third ACM workshop on Mobile cloud computing and services
A defect-tolerant accelerator for emerging high-performance applications

Proceedings of the 39th Annual International Symposium on Computer Architecture
OpenRadio: a programmable wireless dataplane

Proceedings of the first workshop on Hot topics in software defined networks
Operating systems should manage accelerators

HotPar'12 Proceedings of the 4th USENIX conference on Hot Topics in Parallelism
A case of system-level hardware/software co-design and co-verification of a commodity multi-processor system with custom hardware

Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Quantifying the Mismatch between Emerging Scale-Out Applications and Modern Processors

ACM Transactions on Computer Systems (TOCS)
LEAP: latency- energy- and area-optimized lookup pipeline

Proceedings of the eighth ACM/IEEE symposium on Architectures for networking and communications systems
Power challenges may end the multicore era

Communications of the ACM
Homogeneous and heterogeneous MPSoC architectures with network-on-chip connectivity for low-power and real-time multimedia signal processing

VLSI Design
Towards a performance- and energy-efficient data filter cache

Proceedings of the 10th Workshop on Optimizations for DSP and Embedded Systems
Neural Acceleration for General-Purpose Approximate Programs

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Continuous real-world inputs can open up alternative accelerator designs

Proceedings of the 40th Annual International Symposium on Computer Architecture
Convolution engine: balancing efficiency & flexibility in specialized computing

Proceedings of the 40th Annual International Symposium on Computer Architecture
Systematic evaluation of workload clustering for extremely energy-efficient architectures

ACM SIGARCH Computer Architecture News
SGRT: a mobile GPU architecture for real-time ray tracing

Proceedings of the 5th High-Performance Graphics Conference
APE: accelerator processor extensions to optimize data-compute co-location

Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Energy-efficient branch prediction with compiler-guided history stack

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Meet the walkers: accelerating index traversals for in-memory databases

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Q100: the architecture and design of a database processing unit

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
OmpSs@Zynq all-programmable SoC ecosystem

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays
Selecting representative benchmark inputs for exploring microprocessor design spaces

ACM Transactions on Architecture and Code Optimization (TACO)
Accelerating an application domain with specialized functional units

ACM Transactions on Architecture and Code Optimization (TACO)
Designing a practical data filter cache to improve both energy efficiency and performance

ACM Transactions on Architecture and Code Optimization (TACO)
Automated design of networks of transport-triggered architecture processors using dynamic dataflow programs

Image Communication
Optimization of interconnects between accelerators and shared memories in dark silicon

Proceedings of the International Conference on Computer-Aided Design

Quantified Score

Hi-index	0.04

Visualization

Abstract

Due to their high volume, general-purpose processors, and now chip multiprocessors (CMPs), are much more cost effective than ASICs, but lag significantly in terms of performance and energy efficiency. This paper explores the sources of these performance and energy overheads in general-purpose processing systems by quantifying the overheads of a 720p HD H.264 encoder running on a general-purpose CMP system. It then explores methods to eliminate these overheads by transforming the CPU into a specialized system for H.264 encoding. We evaluate the gains from customizations useful to broad classes of algorithms, such as SIMD units, as well as those specific to particular computation, such as customized storage and functional units. The ASIC is 500x more energy efficient than our original four-processor CMP. Broadly applicable optimizations improve performance by 10x and energy by 7x. However, the very low energy costs of actual core ops (100s fJ in 90nm) mean that over 90% of the energy used in these solutions is still "overhead". Achieving ASIC-like performance and efficiency requires algorithm-specific optimizations. For each sub-algorithm of H.264, we create a large, specialized functional unit that is capable of executing 100s of operations per instruction. This improves performance and energy by an additional 25x and the final customized CMP matches an ASIC solution's performance within 3x of its energy and within comparable area.