Principles and applications for supporting similarity queries in non-ordered-discrete and continuous data spaces

  • Authors:
  • Sakti Pramanik;Gang Qian

  • Affiliations:
  • -;-

  • Venue:
  • Principles and applications for supporting similarity queries in non-ordered-discrete and continuous data spaces
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

The problem of similarity queries has received much attention in recent years due to its wide applications in many new and emerging areas. The objective of this thesis is to develop and analyze novel algorithms to support similarity queries using the vector model. In the thesis, we first discuss supporting similarity queries in multidimensional Non-ordered Discrete Data Spaces (NDDS), which are very important for application areas such as Data Mining and Bioinformatics. Existing indexing methods developed for Continuous Data Spaces (CDS) cannot be directly applied to an NDDS due to a lack of some essential geometric concepts/properties. To solve this problem, we established discrete geometrical concepts, which have similar counter parts in a CDS. Based on these concepts, we have developed two novel indexing structures, called the ND-tree and the NSP-tree. The ND-tree is the first index structure of its kind, whose construction algorithms are designed based on the special properties of the NDDS using a data-partitioning approach. The NSP-tree is also based on the special properties of the NDDS but it uses space-partitioning techniques and new strategies such as a partition of the actual data space instead of the whole space and the application of more than one minimum bounding rectangles per node. Our extensive studies show that the performance of the ND-tree and the NSP-tree is significantly better than those of the existing methods. The NSP-tree is shown to be particularly efficient for large skewed datasets. We have proposed the NDh-tree to support similarity queries in Hybrid Data Spaces (HDS), which contain both continuous and non-ordered discrete dimensions. As an extension of the ND-tree, the NDh-tree is developed based on geometrical concepts defined for an HDS and is capable of handling continuous dimensions efficiently. Our experimental results show that the NDh-tree is a promising indexing structure for HDSs. The thesis also addresses the problem of choosing a suitable distance measure for similarity queries using the vector model. The standard criteria for selection of an appropriate distance measure are yet to be found. But in this thesis, we have provided a basis for comparing distance measures for similarity queries. We have done this by introducing a theoretical model to analyze the relationship between two commonly used distance measures, i.e., the Euclidean distance and the cosine angle distance, in multidimensional data spaces. Similar methodology proposed for the model can be used to analyze other distance measures such as the Manhattan distance. We believe that this work provides the fundamental basis for understanding and comparing distance measures for similarity queries.