Indexing dataspaces with partitions

  • Authors:
  • Shaoxu Song;Lei Chen

  • Affiliations:
  • Key Laboratory for Information System Security, Ministry of Education/ TNList/ School of Software, Tsinghua University, Beijing, China;Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong

  • Venue:
  • World Wide Web
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Dataspaces are recently proposed to manage heterogeneous data, with features like partially unstructured, high dimension and extremely sparse. The inverted index has been previously extended to retrieve dataspaces. In order to achieve more efficient access to dataspaces, in this paper, we first introduce our survey of data features in the real dataspaces. Based on the features observed in our study, several partitioning based index approaches are proposed to accelerate the query processing in dataspaces. Specifically, the vertical partitioning index utilizes the partitions on tokens to merge and compress data. We can both reduce the number of I/O reads and avoid aggregation of data inside a compressed list. The horizontal partitioning index supports pruning partitions of tuples in the top-k query. Thus, we can reduce the computation overhead of irrelevant candidate tuples to the query. Finally, we also propose a hybrid index with both vertical and horizontal partitioning. The extensive experiment results in real data sets demonstrate that our approaches outperform the previous techniques and scale well with the large data size.