Accelerating GPGPU architecture simulation

  • Authors:
  • Zhibin Yu;Lieven Eeckhout;Nilanjan Goswami;Tao Li;Lizy John;Hai Jin;Chengzhong Xu

  • Affiliations:
  • Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China;ELIS Department, Ghent University, Ghent, Belgium;Intelligent Design of Efficient Architectures Lab, University of Florida, Gainesville, USA;Intelligent Design of Efficient Architectures Lab, University of Florida, Gainesville, USA;Department of Electrical and Computer Engineering, University of Texas at Austin, Austin, USA;Service Computing Technologies and System Lab/Cluster and Grid Computing Lab, HUST, Wuhan, China;Department of Electrical and Computer Engineering, Wayne State University, Detroit, USA

  • Venue:
  • Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recently, graphics processing units (GPUs) have opened up new opportunities for speeding up general-purpose parallel applications due to their massive computational power and up to hundreds of thousands of threads enabled by programming models such as CUDA. However, due to the serial nature of existing micro-architecture simulators, these massively parallel architectures and workloads need to be simulated sequentially. As a result, simulating GPGPU architectures with typical benchmarks and input data sets is extremely time-consuming. This paper addresses the GPGPU architecture simulation challenge by generating miniature, yet representative GPGPU kernels. We first summarize the static characteristics of an existing GPGPU kernel in a profile, and analyze its dynamic behavior using the novel concept of the divergence flow statistics graph (DFSG). We subsequently use a GPGPU kernel synthesizing framework to generate a miniature proxy of the original kernel, which can reduce simulation time significantly. The key idea is to reduce the number of simulated instructions by decreasing per-thread iteration counts of loops. Our experimental results show that our approach can accelerate GPGPU architecture simulation by a factor of 88X on average and up to 589X with an average IPC relative error of 5.6%.