Statistical profile estimation in database systems
ACM Computing Surveys (CSUR)
A bridging model for parallel computation
Communications of the ACM
Query Optimization in Database Systems
ACM Computing Surveys (CSUR)
Foundations of Databases: The Logical Level
Foundations of Databases: The Logical Level
Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution
Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Fast computation of database operations using graphics processors
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
GPUTeraSort: high performance graphics co-processor sorting for large database management
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
General purpose molecular dynamics simulations fully implemented on graphics processing units
Journal of Computational Physics
Cache-oblivious databases: Limitations and opportunities
ACM Transactions on Database Systems (TODS)
Harmony: an execution model and runtime for heterogeneous many core systems
HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Data parallel acceleration of decision support queries using Cell/BE and GPUs
Proceedings of the 6th ACM conference on Computing frontiers
A Fast Similarity Join Algorithm Using Graphics Processing Units
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Frequent itemset mining on graphics processors
Proceedings of the Fifth International Workshop on Data Management on New Hardware
Efficient stream compaction on wide SIMD many-core architectures
Proceedings of the Conference on High Performance Graphics 2009
Relational query coprocessing on graphics processors
ACM Transactions on Database Systems (TODS)
A Skeletal Parallel Framework with Fusion Optimizer for GPGPU Programming
APLAS '09 Proceedings of the 7th Asian Symposium on Programming Languages and Systems
Accelerating SQL database operations on a GPU with CUDA
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
OptiX: a general purpose ray tracing engine
ACM SIGGRAPH 2010 papers
Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications
Proceedings of the 37th annual international symposium on Computer architecture
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Exploring graphics processing units as parallel coprocessors for online aggregation
DOLAP '10 Proceedings of the ACM 13th international workshop on Data warehousing and OLAP
Database compression on graphics processors
Proceedings of the VLDB Endowment
Accelerating Haskell array codes with multicore GPUs
Proceedings of the sixth workshop on Declarative aspects of multicore programming
Copperhead: compiling an embedded data parallel language
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Bioinformatics
Datalog and emerging applications: an interactive tutorial
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Fast intersection algorithms for sorted sequences
Algorithms and Applications
Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems
ISPASS '12 Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software
Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission
IPDPSW '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum
EGVE'05 Proceedings of the 11th Eurographics conference on Virtual Environments
Relational algorithms for multi-bulk-synchronous processors
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Accelerating simulation of agent-based models on heterogeneous architectures
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Optimizing select conditions on GPUs
Proceedings of the Ninth International Workshop on Data Management on New Hardware
LINQits: big data on little clients
Proceedings of the 40th Annual International Symposium on Computer Architecture
Semi-automatic restructuring of offloadable tasks for many-core accelerators
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
ACM SIGOPS 24th Symposium on Operating Systems Principles
Dandelion: a compiler and runtime for heterogeneous systems
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
The Yin and Yang of processing data warehousing queries on GPU devices
Proceedings of the VLDB Endowment
Rhythm: harnessing data parallel hardware for server workloads
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Red Fox: An Execution Environment for Relational Query Processing on GPUs
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Hi-index | 0.00 |
Data warehousing applications represent an emerging application arena that requires the processing of relational queries and computations over massive amounts of data. Modern general purpose GPUs are high bandwidth architectures that potentially offer substantial improvements in throughput for these applications. However, there are significant challenges that arise due to the overheads of data movement through the memory hierarchy and between the GPU and host CPU. This paper proposes data movement optimizations to address these challenges. Inspired in part by loop fusion optimizations in the scientific computing community, we propose kernel fusion as a basis for data movement optimizations. Kernel fusion fuses the code bodies of two GPU kernels to i) reduce data footprint to cut down data movement throughout GPU and CPU memory hierarchy, and ii) enlarge compiler optimization scope. We classify producer consumer dependences between compute kernels into three types, i) fine-grained thread-to-thread dependences, ii) medium-grained thread block dependences, and iii) coarse-grained kernel dependences. Based on this classification, we propose a compiler framework, Kernel Weaver, that can automatically fuse relational algebra operators thereby eliminating redundant data movement. The experiments on NVIDIA Fermi platforms demonstrate that kernel fusion achieves 2.89x speedup in GPU computation and a 2.35x speedup in PCIe transfer time on average across the micro-benchmarks tested. We present key insights, lessons learned, measurements from our compiler implementation, and opportunities for further improvements.