Systematic evaluation of workload clustering for extremely energy-efficient architectures

Authors:
Apala Guha;Yao Zhang;Raihan ur Rasool;Andrew A. Chien
Affiliations:
University of Chicago, Chicago, Illinois;University of Chicago, Chicago, Illinois;University of Chicago, Chicago, Illinois;University of Chicago, Chicago, Illinois
Venue:
ACM SIGARCH Computer Architecture News
Year:
2013

Citing 19
Cited 0

Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
MiBench: A free, commercially representative embedded benchmark suite

WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop
Anton, a special-purpose machine for molecular dynamics simulation

Proceedings of the 34th annual international symposium on Computer architecture
BioBench: A Benchmark Suite of Bioinformatics Applications

ISPASS '05 Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2005
An Energy-Efficient Processor Architecture for Embedded Systems

IEEE Computer Architecture Letters
Amdahl's Law in the Multicore Era

Computer
Conservation cores: reducing the energy of mature computations

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Understanding sources of inefficiency in general-purpose chips

Proceedings of the 37th annual international symposium on Computer architecture
ERCBench: An Open-Source Benchmark Suite for Embedded and Reconfigurable Computing

FPL '10 Proceedings of the 2010 International Conference on Field Programmable Logic and Applications
The future of microprocessors

Communications of the ACM
Dark silicon and the end of multicore scaling

Proceedings of the 38th annual international symposium on Computer architecture
Dynamically Specialized Datapaths for energy efficient computing

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Toward Dark Silicon in Servers

IEEE Micro
How sensitive is processor customization to the workload's input datasets?

SASP '11 Proceedings of the 2011 IEEE 9th Symposium on Application Specific Processors
Benchmarking modern multiprocessors

Benchmarking modern multiprocessors
Bundled execution of recurring traces for energy-efficient general purpose processing

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
A HW/SW co-designed heterogeneous multi-core virtual machine for energy-efficient general purpose computing

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Near-threshold voltage (NTV) design: opportunities and challenges

Proceedings of the 49th Annual Design Automation Conference
Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures

IEEE Transactions on Computers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Chip power consumption has reached its limits, leading to the flattening of single-core performance. We propose the 10x10 processor, a federated heterogeneous multi-core architecture, where each core is an ensemble of u-engines (micro-engines, similar to accelerators) specialized for different workload groups to achieve dramatically higher energy efficiency. The u-engines collectively target the entire general-purpose workload space. The problem we study in this article is selecting the set of workloads that each u-engine should be customized for. For this problem we study the computation structure of a wide variety of workloads and cluster together workloads with similar computation structures, the idea being that each u-engine will be customized for the compute structures exhibited by a particular cluster. The constraint on this problem is the silicon budget of a processor. Lower silicon budgets accommodate fewer uengines and require individual u-engines to target larger segments of the workload space which leads to lower energy efficiency benefits from customization, because there is more variation among the compute structures making up each cluster. Therefore, we also study how workload coverage and benefit can be maximized for a given silicon budget. We study a broad general-purpose workload that includes 34 codes from 6 benchmark suites, identifying the most frequent functions, and clustering them based on two sets of instruction usage features (high-resolution and low-resolution) into 8, 16, 32, 64, 128 clusters respectively. We develop abstract metrics (coverage and weighted customization benefit) to evaluate the clusters. We show significant potential payoffs with four benefit models: 2-3x (square root model), 4-10x (linear model), 12-24x (quadratic model), and 22-26x (cubic model).