APE: accelerator processor extensions to optimize data-compute co-location

Authors:
Ganesh Venkatesh
Affiliations:
Intel Labs, Hillsboro, Oregon
Venue:
Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Year:
2013

Citing 12
Cited 0

Active messages: a mechanism for integrated communication and computation

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
The Transmeta Code Morphing™ Software: using speculation, recovery, and adaptive retranslation to address real-life challenges

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Predicting whole-program locality through reuse distance analysis

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
An Evaluation of Thread Migration for Exploiting Distributed Array Locality

HPCS '02 Proceedings of the 16th Annual International Symposium on High Performance Computing Systems and Applications
SPEC CPU2006 benchmark descriptions

ACM SIGARCH Computer Architecture News
VEAL: Virtualized Execution Accelerator for Loops

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
A novel algorithm for incompressible flow using only a coarse grid projection

ACM SIGGRAPH 2010 papers
Understanding sources of inefficiency in general-purpose chips

Proceedings of the 37th annual international symposium on Computer architecture
Efficient complex operators for irregular codes

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
"Single-chip cloud computer", an IA tera-scale research processor

Euro-Par 2010 Proceedings of the 2010 conference on Parallel processing
Benchmarking modern multiprocessors

Benchmarking modern multiprocessors
QsCores: trading dark silicon for scalable energy efficiency with quasi-specific cores

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Two technological trends we notice in the current day systems is the march towards many core systems and greater focus on power efficiency. The increase in core counts would result in smaller caches-per-compute node and greater reliance on exposing task-level parallelism in applications. However, this would potentially increase the amount of data that moves within and between the different tasks and hence, the related power costs. This will pose a new burden on the already power-constrained current day systems. The situation would only get worse as we go forward because the power consumed by the wires is not scaling down much with each technology generation, but the amount of data that these wires move is increasing per generation. This paper addresses this concern by identifying the memory access patterns that accounts for much of the data movement and designing processor extensions, Apes to support them. These processor extensions are placed closer to the cache structures, rather than the core pipeline, to reduce the data movement and improve compute-data co-location. We show that by doing this we are able to reduce a task's memory accesses by ~2.5×, data movement by 4× and cache miss rate by 40% for a wide range of applications.