Application-specific instruction generation for configurable processor architectures
FPGA '04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays
Flexible architectures for engineering successful SOCs
Proceedings of the 41st annual Design Automation Conference
Automated Custom Instruction Generation for Domain-Specific Processor Acceleration
IEEE Transactions on Computers
Scientific applications vs. SPEC-FP: a comparison of program behavior
Proceedings of the 20th annual international conference on Supercomputing
Customizable Embedded Processors: Design Technologies and Applications
Customizable Embedded Processors: Design Technologies and Applications
Proceedings of the 17th ACM Great Lakes symposium on VLSI
Scaling, Power and the Future of CMOS
VLSID '07 Proceedings of the 20th International Conference on VLSI Design held jointly with 6th International Conference: Embedded Systems
Characteristics of workloads used in high performance and technical computing
Proceedings of the 21st annual international conference on Supercomputing
Chip multi-processor generator
Proceedings of the 44th annual Design Automation Conference
Optimization of sparse matrix-vector multiplication on emerging multicore platforms
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
An Energy-Efficient Processor Architecture for Embedded Systems
IEEE Computer Architecture Letters
A 242mW, 10mm21080p H.264/AVC high profile encoder chip
Proceedings of the 45th annual Design Automation Conference
AnySP: anytime anywhere anyway signal processing
Proceedings of the 36th annual international symposium on Computer architecture
A memory system design framework: creating smart memories
Proceedings of the 36th annual international symposium on Computer architecture
Using a configurable processor generator for computer architecture prototyping
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Overview of the H.264/AVC video coding standard
IEEE Transactions on Circuits and Systems for Video Technology
Analysis and architecture design of an HDTV720p 30 frames/s H.264/AVC encoder
IEEE Transactions on Circuits and Systems for Video Technology
High-Throughput Architecture for H.264/AVC CABAC Compression System
IEEE Transactions on Circuits and Systems for Video Technology
Hardware implementation of micropolygon rasterization with motion and defocus blur
Proceedings of the Conference on High Performance Graphics
Communications of the ACM
A novel thread scheduler design for polymorphic embedded systems
CASES '11 Proceedings of the 14th international conference on Compilers, architectures and synthesis for embedded systems
An energy-efficient patchable accelerator for post-silicon engineering changes
CODES+ISSS '11 Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Liszt: a domain specific language for building portable mesh-based PDE solvers
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
The accelerator store: A shared memory framework for accelerator-based systems
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Randomized accuracy-aware program transformations for efficient approximate computations
POPL '12 Proceedings of the 39th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Compiling high throughput network processors
Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays
Post-silicon debugging targeting electrical errors with patchable controllers (abstract only)
Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays
Clearing the clouds: a study of emerging scale-out workloads on modern hardware
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Bundled execution of recurring traces for energy-efficient general purpose processing
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Idempotent processor architecture
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
QsCores: trading dark silicon for scalable energy efficiency with quasi-specific cores
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
VISION: cloud-powered sight for all: showing the cloud what you see
Proceedings of the third ACM workshop on Mobile cloud computing and services
A defect-tolerant accelerator for emerging high-performance applications
Proceedings of the 39th Annual International Symposium on Computer Architecture
OpenRadio: a programmable wireless dataplane
Proceedings of the first workshop on Hot topics in software defined networks
Operating systems should manage accelerators
HotPar'12 Proceedings of the 4th USENIX conference on Hot Topics in Parallelism
Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Quantifying the Mismatch between Emerging Scale-Out Applications and Modern Processors
ACM Transactions on Computer Systems (TOCS)
LEAP: latency- energy- and area-optimized lookup pipeline
Proceedings of the eighth ACM/IEEE symposium on Architectures for networking and communications systems
Power challenges may end the multicore era
Communications of the ACM
Towards a performance- and energy-efficient data filter cache
Proceedings of the 10th Workshop on Optimizations for DSP and Embedded Systems
Neural Acceleration for General-Purpose Approximate Programs
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Continuous real-world inputs can open up alternative accelerator designs
Proceedings of the 40th Annual International Symposium on Computer Architecture
Convolution engine: balancing efficiency & flexibility in specialized computing
Proceedings of the 40th Annual International Symposium on Computer Architecture
Systematic evaluation of workload clustering for extremely energy-efficient architectures
ACM SIGARCH Computer Architecture News
SGRT: a mobile GPU architecture for real-time ray tracing
Proceedings of the 5th High-Performance Graphics Conference
APE: accelerator processor extensions to optimize data-compute co-location
Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Energy-efficient branch prediction with compiler-guided history stack
DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Meet the walkers: accelerating index traversals for in-memory databases
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Q100: the architecture and design of a database processing unit
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
OmpSs@Zynq all-programmable SoC ecosystem
Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays
Selecting representative benchmark inputs for exploring microprocessor design spaces
ACM Transactions on Architecture and Code Optimization (TACO)
Accelerating an application domain with specialized functional units
ACM Transactions on Architecture and Code Optimization (TACO)
Designing a practical data filter cache to improve both energy efficiency and performance
ACM Transactions on Architecture and Code Optimization (TACO)
Optimization of interconnects between accelerators and shared memories in dark silicon
Proceedings of the International Conference on Computer-Aided Design
Hi-index | 0.04 |
Due to their high volume, general-purpose processors, and now chip multiprocessors (CMPs), are much more cost effective than ASICs, but lag significantly in terms of performance and energy efficiency. This paper explores the sources of these performance and energy overheads in general-purpose processing systems by quantifying the overheads of a 720p HD H.264 encoder running on a general-purpose CMP system. It then explores methods to eliminate these overheads by transforming the CPU into a specialized system for H.264 encoding. We evaluate the gains from customizations useful to broad classes of algorithms, such as SIMD units, as well as those specific to particular computation, such as customized storage and functional units. The ASIC is 500x more energy efficient than our original four-processor CMP. Broadly applicable optimizations improve performance by 10x and energy by 7x. However, the very low energy costs of actual core ops (100s fJ in 90nm) mean that over 90% of the energy used in these solutions is still "overhead". Achieving ASIC-like performance and efficiency requires algorithm-specific optimizations. For each sub-algorithm of H.264, we create a large, specialized functional unit that is capable of executing 100s of operations per instruction. This improves performance and energy by an additional 25x and the final customized CMP matches an ASIC solution's performance within 3x of its energy and within comparable area.