Nordic Journal of Computing
GPU Gems 2: Programming Techniques for High-Performance Graphics and General-Purpose Computation (Gpu Gems)
Efficient stream compaction on wide SIMD many-core architectures
Proceedings of the Conference on High Performance Graphics 2009
An Efficient GPU Implementation for Large Scale Individual-Based Simulation of Collective Behavior
HIBI '09 Proceedings of the 2009 International Workshop on High Performance Computational Systems Biology
Efficient simulation of agent-based models on multi-GPU and multi-core clusters
Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques
Fast in-place sorting with CUDA based on bitonic sort
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Designing APU Oriented Scientific Computing Applications in OpenCL
HPCC '11 Proceedings of the 2011 IEEE International Conference on High Performance Computing and Communications
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11 Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing
GPUs and the Future of Parallel Computing
IEEE Micro
Designing a unified programming model for heterogeneous machines
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors
Proceedings of Workshop on General Purpose Processing Using GPUs
Hi-index | 0.00 |
The wide usage of GPGPU programming models and compiler techniques enables the optimization of data-parallel programs on commodity GPUs. However, mapping GPGPU applications running on discrete parts to emerging integrated heterogeneous architectures such as the AMD Fusion APU and Intel Sandy/Ivy bridge with the CPU and the GPU on the same die has not been well studied. Classic time-step simulation applications represented by agent-based models have the intrinsic parallel structure that is a good fit for GPGPU architectures. However, when mapping these applications directly to the integrated GPUs, the performance may degrade due to less computation units and lower clock speed. This paper proposes an optimization to the GPGPU implementation of the agent-based model and illustrates it in the traffic simulation example. The optimization adapts the algorithm by moving part of the workload to the CPU to leverage the integrated architecture and the on-chip memory bus which is faster than the PCIe bus that connects the discrete GPU and the host. The experiments on discrete AMD Radeon GPU and AMD Fusion APU demonstrate that the optimization can achieve 1.08--2.71x performance speedup on the integrated architecture over the discrete platform.