The tradeoffs of fused memory hierarchies in heterogeneous computing architectures

Authors:
Kyle L. Spafford;Jeremy S. Meredith;Seyong Lee;Dong Li;Philip C. Roth;Jeffrey S. Vetter
Affiliations:
Oak Ridge National Laboratory, Oak Ridge, TN, USA;Oak Ridge National Laboratory, Oak Ridge, TN, USA;Oak Ridge National Laboratory, Oak Ridge, USA;Oak Ridge National Laboratory, Oak Ridge, TN, USA;Oak Ridge National Laboratory, Oak Ridge, TN, USA;Oak Ridge National Laboratory, Oak Ridge, TN, USA
Venue:
Proceedings of the 9th conference on Computing Frontiers
Year:
2012

Citing 15
Cited 6

Parallel Programmability and the Chapel Language

International Journal of High Performance Computing Applications
Program optimization space pruning for a multithreaded gpu

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Program optimization carving for GPU computing

Journal of Parallel and Distributed Computing
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A control-structure splitting optimization for GPGPU

Proceedings of the 6th ACM conference on Computing frontiers
StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Auto-tuning 3-D FFT library for CUDA GPUs

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
The Scalable Heterogeneous Computing (SHOC) benchmark suite

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
The Cilk++ concurrency platform

The Journal of Supercomputing
The International Exascale Software Project roadmap

International Journal of High Performance Computing Applications
Quantifying NUMA and contention effects in multi-GPU systems

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

SAAHPC '11 Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing
Performance and Power Analysis of ATI GPU: A Statistical Approach

NAS '11 Proceedings of the 2011 IEEE Sixth International Conference on Networking, Architecture, and Storage
Performance Characterization and Optimization of Atomic Operations on AMD GPUs

CLUSTER '11 Proceedings of the 2011 IEEE International Conference on Cluster Computing
The convergence of HPC and embedded systems in our heterogeneous computing future

ICCD '11 Proceedings of the 2011 IEEE 29th International Conference on Computer Design

Early evaluation of directive-based GPU programming models for productive exascale computing

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Valar: a benchmark suite to study the dynamic behavior of heterogeneous systems

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
The HELLS-join: a heterogeneous stream join for extremely large windows

Proceedings of the Ninth International Workshop on Data Management on New Hardware
Evaluating integrated graphics processors for data center workloads

Proceedings of the Workshop on Power-Aware Computing and Systems
Easy, fast, and energy-efficient object detection on heterogeneous on-chip architectures

ACM Transactions on Architecture and Code Optimization (TACO)
ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

Proceedings of Workshop on General Purpose Processing Using GPUs

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the rise of general purpose computing on graphics processing units (GPGPU), the influence from consumer markets can now be seen across the spectrum of computer architectures. In fact, many of the high-ranking Top500 HPC systems now include these accelerators. Traditionally, GPUs have connected to the CPU via the PCIe bus, which has proved to be a significant bottleneck for scalable scientific applications. Now, a trend toward tighter integration between CPU and GPU has removed this bottleneck and unified the memory hierarchy for both CPU and GPU cores. We examine the impact of this trend for high performance scientific computing by investigating AMD's new Fusion Accelerated Processing Unit (APU) as a testbed. In particular, we evaluate the tradeoffs in performance, power consumption, and programmability when comparing this unified memory hierarchy with similar, but discrete GPUs.