Subspace Similarity Search under {\rm L}_p-Norm

  • Authors:
  • Xiang Lian;Lei Chen

  • Affiliations:
  • University of Texas - Pan American, Edinburg;Hong Kong University of Science and Technology, Hong Kong

  • Venue:
  • IEEE Transactions on Knowledge and Data Engineering
  • Year:
  • 2012

Quantified Score

Hi-index 0.01

Visualization

Abstract

Similarity search has been widely used in many applications such as information retrieval, image data analysis, and time-series matching. Previous work on similarity search usually consider the search problem in the full space. In this paper, however, we tackle a problem, subspace similarity search, which finds all data objects that match with a query object in the subspace instead of the original full space. In particular, the query object can specify arbitrary subspace with arbitrary number of dimensions. Due to the exponential number of possible subspaces specified by users, we introduce an efficient and effective pruning technique, which assigns scores to data objects with respect to pivots and prunes candidates via scores. We propose an effective multipivot-based method to preprocess data objects by selecting appropriate pivots, where the entire procedure is guided by a formal cost model, such that the pruning power is maximized. Then, scores of each data object are organized in sorted lists to facilitate an efficient subspace similarity search. Furthermore, many real-world application data such as image databases, time-series data, and sensory data often contain noises, which can be modeled as uncertain objects. Different from certain data, efficient query processing on uncertain data is more challenging due to its intensive computation of probability confidences. Thus, it is also crucial to answer subspace queries efficiently and effectively over uncertain objects. Specifically, we define a novel query, namely probabilistic subspace range query (PSRQ) in the uncertain database, which finds objects within a distance from a query object in any subspace with high probability. To address this query, we extend our proposed pruning techniques for precise data to that of answering PSRQ in arbitrary subspaces. Extensive experiments demonstrated the performance of our proposed approaches.