Exploiting Coarse-Grained Parallelism in B+ Tree Searches on an APU

  • Authors:
  • Mayank Daga;Mark Nutter

  • Affiliations:
  • -;-

  • Venue:
  • SCC '12 Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

B+ tree structured index searches are one of the fundamental database operations and hence, accelerating them is essential. GPUs provide a compelling mix of performance per watt and performance per dollar, and thus are an attractive platform for accelerating B+ tree searches. However, tree search on discrete GPUs presents significant challenges for acceleration due to (i) the irregular representation in memory and (ii) the requirement to copy the tree to the GPU memory over the PCIe bus. In this paper, we present the acceleration of B+ tree searches on a fused CPU+GPU processor (an accelerated processing unit or APU). We counter the aforementioned issues by reorganizing the B+ tree in memory and utilizing the novel heterogeneous system architecture, which eliminates (i) the need to copy the tree to the GPU and (ii) the limitation on the size of the tree that can be accelerated. Our approach exploits the coarse-grained parallelism in tree search, wherein we execute multiple searches in parallel to optimize for the SIMD width without modifying the inherent B+ tree data structure. Our results illustrate that the APU implementation can perform up to 70M queries per second and is 4.9x faster in the best case and 2.5x faster on average than a hand-tuned, SSE-optimized, six-core CPU implementation, for varying orders of the B+ tree with 4M keys. We also present an analysis of the effect of caches on performance, and of the efficacy of the APU to eliminate data-copies.