Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

Authors:
Haicheng Wu;Gregory Diamos;Srihari Cadambi;Sudhakar Yalamanchili
Affiliations:
-;-;-;-
Venue:
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2012

Citing 31
Cited 10

Statistical profile estimation in database systems

ACM Computing Surveys (CSUR)
A bridging model for parallel computation

Communications of the ACM
Query Optimization in Database Systems

ACM Computing Surveys (CSUR)
Foundations of Databases: The Logical Level

Foundations of Databases: The Logical Level
Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Fast computation of database operations using graphics processors

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
GPUTeraSort: high performance graphics co-processor sorting for large database management

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
General purpose molecular dynamics simulations fully implemented on graphics processing units

Journal of Computational Physics
Cache-oblivious databases: Limitations and opportunities

ACM Transactions on Database Systems (TODS)
Harmony: an execution model and runtime for heterogeneous many core systems

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Data parallel acceleration of decision support queries using Cell/BE and GPUs

Proceedings of the 6th ACM conference on Computing frontiers
A Fast Similarity Join Algorithm Using Graphics Processing Units

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Frequent itemset mining on graphics processors

Proceedings of the Fifth International Workshop on Data Management on New Hardware
Efficient stream compaction on wide SIMD many-core architectures

Proceedings of the Conference on High Performance Graphics 2009
Relational query coprocessing on graphics processors

ACM Transactions on Database Systems (TODS)
A Skeletal Parallel Framework with Fusion Optimizer for GPGPU Programming

APLAS '09 Proceedings of the 7th Asian Symposium on Programming Languages and Systems
Accelerating SQL database operations on a GPU with CUDA

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
OptiX: a general purpose ray tracing engine

ACM SIGGRAPH 2010 papers
Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications

Proceedings of the 37th annual international symposium on Computer architecture
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Exploring graphics processing units as parallel coprocessors for online aggregation

DOLAP '10 Proceedings of the ACM 13th international workshop on Data warehousing and OLAP
Database compression on graphics processors

Proceedings of the VLDB Endowment
Accelerating Haskell array codes with multicore GPUs

Proceedings of the sixth workshop on Declarative aspects of multicore programming
Copperhead: compiling an embedded data parallel language

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
GPU-BLAST

Bioinformatics
Datalog and emerging applications: an interactive tutorial

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Fast intersection algorithms for sorted sequences

Algorithms and Applications
Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems

ISPASS '12 Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software
Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

IPDPSW '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum
Real-time deformation of detailed geometry based on mappings to a less detailed physical simulation on the GPU

EGVE'05 Proceedings of the 11th Eurographics conference on Virtual Environments

Relational algorithms for multi-bulk-synchronous processors

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Accelerating simulation of agent-based models on heterogeneous architectures

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Optimizing select conditions on GPUs

Proceedings of the Ninth International Workshop on Data Management on New Hardware
LINQits: big data on little clients

Proceedings of the 40th Annual International Symposium on Computer Architecture
Semi-automatic restructuring of offloadable tasks for many-core accelerators

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Dandelion: a compiler and runtime for heterogeneous systems

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
The Yin and Yang of processing data warehousing queries on GPU devices

Proceedings of the VLDB Endowment
Rhythm: harnessing data parallel hardware for server workloads

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Red Fox: An Execution Environment for Relational Query Processing on GPUs

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data warehousing applications represent an emerging application arena that requires the processing of relational queries and computations over massive amounts of data. Modern general purpose GPUs are high bandwidth architectures that potentially offer substantial improvements in throughput for these applications. However, there are significant challenges that arise due to the overheads of data movement through the memory hierarchy and between the GPU and host CPU. This paper proposes data movement optimizations to address these challenges. Inspired in part by loop fusion optimizations in the scientific computing community, we propose kernel fusion as a basis for data movement optimizations. Kernel fusion fuses the code bodies of two GPU kernels to i) reduce data footprint to cut down data movement throughout GPU and CPU memory hierarchy, and ii) enlarge compiler optimization scope. We classify producer consumer dependences between compute kernels into three types, i) fine-grained thread-to-thread dependences, ii) medium-grained thread block dependences, and iii) coarse-grained kernel dependences. Based on this classification, we propose a compiler framework, Kernel Weaver, that can automatically fuse relational algebra operators thereby eliminating redundant data movement. The experiments on NVIDIA Fermi platforms demonstrate that kernel fusion achieves 2.89x speedup in GPU computation and a 2.35x speedup in PCIe transfer time on average across the micro-benchmarks tested. We present key insights, lessons learned, measurements from our compiler implementation, and opportunities for further improvements.