Accelerating statistical static timing analysis using graphics processing units

  • Authors:
  • Kanupriya Gulati;Sunil P. Khatri

  • Affiliations:
  • Texas A&M University, College Station, TX;Texas A&M University, College Station, TX

  • Venue:
  • Proceedings of the 2009 Asia and South Pacific Design Automation Conference
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we explore the implementation of Monte Carlo based statistical static timing analysis (SSTA) on a Graphics Processing Unit (GPU). SSTA via Monte Carlo simulations is a computationally expensive, but important step required to achieve design timing closure. It provides an accurate estimate of delay variations and their impact on design yield. The large number of threads that can be computed in parallel on a GPU suggests a natural fit for the problem of Monte Carlo based SSTA to the GPU platform. Our implementation performs multiple delay simulations at a single gate in parallel. A parallel implementation of the Mersenne Twister pseudo-random number generator on the GPU, followed by Box-Muller transformations (also implemented on the GPU) is used for generating gate delay numbers from a normal distribution. The μ and σ of the pin-to-output delay distributions for all inputs and for every gate, are obtained using a memory lookup, which benefits from the large memory bandwidth of the GPU. Threads which execute in parallel have no data/control dependencies on each other. All threads compute identical instructions, but on different data, as required by the Single Instruction Multiple Data (SIMD) programming semantics of the GPU. Our approach is implemented on a NVIDIA GeForce GTX 8800 GPU card. Our results indicate that our approach can obtain an average speedup of about 260x as compared to a serial CPU implementation. With the recently announced quad 8800 GPU cards, we estimate that our approach would attain a speedup of over 785x. The correctness of the Monte Carlo based SSTA implemented on a GPU has been verified by comparing its results with a CPU based implementation.