Efficiently evaluating skyline queries on RDF databases

  • Authors:
  • Ling Chen;Sidan Gao;Kemafor Anyanwu

  • Affiliations:
  • Semantic Computing Research Lab, Department of Computer Science, North Carolina State University;Semantic Computing Research Lab, Department of Computer Science, North Carolina State University;Semantic Computing Research Lab, Department of Computer Science, North Carolina State University

  • Venue:
  • ESWC'11 Proceedings of the 8th extended semantic web conference on The semanic web: research and applications - Volume Part II
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Skyline queries are a class of preference queries that compute the pareto-optimal tuples from a set of tuples and are valuable for multicriteria decision making scenarios. While this problem has received significant attention in the context of single relational table, skyline queries over joins of multiple tables that are typical of storage models for RDF data has received much less attention. A naïve approach such as a join-first-skyline-later strategy splits the join and skyline computation phases which limit opportunities for optimization. Other existing techniques for multi-relational skyline queries assume storage and indexing techniques that are not typically used with RDF which would require a preprocessing step for data transformation. In this paper, we present an approach for optimizing skyline queries over RDF data stored using a vertically partitioned schema model. It is based on the concept of a "Header Point" which maintains a concise summary of the already visited regions of the data space. This summary allows some fraction of nonskyline tuples to be pruned from advancing to the skyline processing phase, thus reducing the overall cost of expensive dominance checks required in the skyline phase. We further present more aggressive pruning rules that result in the computation of near-complete skylines in significantly less time than the complete algorithm. A comprehensive performance evaluation of different algorithms is presented using datasets with different types of data distributions generated by a benchmark data generator.