Active messages: a mechanism for integrated communication and computation
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Predicting whole-program locality through reuse distance analysis
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
An Evaluation of Thread Migration for Exploiting Distributed Array Locality
HPCS '02 Proceedings of the 16th Annual International Symposium on High Performance Computing Systems and Applications
SPEC CPU2006 benchmark descriptions
ACM SIGARCH Computer Architecture News
VEAL: Virtualized Execution Accelerator for Loops
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
A novel algorithm for incompressible flow using only a coarse grid projection
ACM SIGGRAPH 2010 papers
Understanding sources of inefficiency in general-purpose chips
Proceedings of the 37th annual international symposium on Computer architecture
Efficient complex operators for irregular codes
HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
"Single-chip cloud computer", an IA tera-scale research processor
Euro-Par 2010 Proceedings of the 2010 conference on Parallel processing
Benchmarking modern multiprocessors
Benchmarking modern multiprocessors
QsCores: trading dark silicon for scalable energy efficiency with quasi-specific cores
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
Two technological trends we notice in the current day systems is the march towards many core systems and greater focus on power efficiency. The increase in core counts would result in smaller caches-per-compute node and greater reliance on exposing task-level parallelism in applications. However, this would potentially increase the amount of data that moves within and between the different tasks and hence, the related power costs. This will pose a new burden on the already power-constrained current day systems. The situation would only get worse as we go forward because the power consumed by the wires is not scaling down much with each technology generation, but the amount of data that these wires move is increasing per generation. This paper addresses this concern by identifying the memory access patterns that accounts for much of the data movement and designing processor extensions, Apes to support them. These processor extensions are placed closer to the cache structures, rather than the core pipeline, to reduce the data movement and improve compute-data co-location. We show that by doing this we are able to reduce a task's memory accesses by ~2.5×, data movement by 4× and cache miss rate by 40% for a wide range of applications.