Efficient top-K approximate searches against a relation with multiple attributes

  • Authors:
  • Wei Lu;Jinchuan Chen;Xiaoyong Du;Jieping Wang;Wei Pan

  • Affiliations:
  • School of Information, Renmin University of China, Beijing, China 100872 and Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, Beijing, China;Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, Beijing, China;School of Information, Renmin University of China, Beijing, China 100872 and Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, Beijing, China;China Electronics Standardization Institute, Beijing, China;School of Computer Science and Engineering, Northwestern Polytechnical University, Xi'an, China

  • Venue:
  • World Wide Web
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we study the problem of efficiently identifying K records that are most similar to a given query record, where the similarity is defined as: (1) for each record, we calculate the similarity score between the record and the query record over each individual attribute using a specific similarity function; (2) an aggregate function is utilized to combine these similarity scores with weights and the aggregated value is served as the similarity of the record. After similarities of all records have been computed, K records with the greatest similarities can further be identified. Under this framework, unfortunately, the computational cost will be extremely expensive when the cardinality of relation is large as computation of similarity for each record is required. As a result, in this paper, we propose two efficient algorithms, named ScanIndex and Top-Down (TD for short), to cope with this problem. With respect to ScanIndex, similarity scores that are equal to zero over individual attributes are free from computation. Based on ScanIndex, with respect to TD, similarity scores less than thresholds (rather than zero) over individual attributes are skipped, where these thresholds are improved dynamically over time. Experimental results demonstrate that, comparing with the naive approach, the performance can be improved by two orders of magnitude using ScanIndex and TD.