Pattern Recognition Letters
Combining fuzzy information from multiple systems (extended abstract)
PODS '96 Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Minimal probing: supporting expensive predicates for top-k queries
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Combining fuzzy information: an overview
ACM SIGMOD Record
Optimal aggregation algorithms for middleware
Journal of Computer and System Sciences - Special issu on PODS 2001
D-Index: Distance Searching Index for Metric Data Sets
Multimedia Tools and Applications
Evaluating top-k queries over web-accessible databases
ACM Transactions on Database Systems (TODS)
On Learning Asymmetric Dissimilarity Measures
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Similarity Search: The Metric Space Approach (Advances in Database Systems)
Similarity Search: The Metric Space Approach (Advances in Database Systems)
Efficient Aggregation of Ranked Inputs
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
IO-Top-k: index-access optimized top-k query processing
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Progressive and selective merge: computing top-k with ad-hoc ranking functions
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
A Data Structure and an Algorithm for the Nearest Point Problem
IEEE Transactions on Software Engineering
Top-k query evaluation with probabilistic guarantees
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
On fast non-metric similarity search by metric access methods
EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Efficient skyline retrieval with arbitrary similarity measures
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Efficient processing of exact top-k queries over disk-resident sorted lists
The VLDB Journal — The International Journal on Very Large Data Bases
Rights protection of trajectory datasets with nearest-neighbor preservation
The VLDB Journal — The International Journal on Very Large Data Bases
Efficient RkNN retrieval with arbitrary non-metric similarity measures
Proceedings of the VLDB Endowment
Efficient reverse skyline retrieval with arbitrary non-metric similarity measures
Proceedings of the 14th International Conference on Extending Database Technology
Efficient similarity search: arbitrary similarity measures, arbitrary composition
Proceedings of the 20th ACM international conference on Information and knowledge management
Retrieving similar discussion forum threads: a structure based approach
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Cost-aware query planning for similarity search
Information Systems
Hi-index | 0.00 |
The top-k retrieval problem requires finding k objects most similar to a given query object. Similarities between objects are most often computed as aggregated similarities of their attribute values. We consider the case where the similarities between attribute values are arbitrary (non-metric), due to which standard space partitioning indexes cannot be used. Among the most popular techniques that can handle arbitrary similarity measures is the family of threshold algorithms. These were designed as middleware algorithms that assume that similarity lists for each attribute are available and focus on efficiently merging these lists to arrive at the results. In this paper, we explore multi-dimensional indexing of non-metric spaces that can lead to efficient pruning of the search space utilizing inter-attribute relationships, during top-k computation. We propose an indexing structure, the AL-Tree and an algorithm to do top-k retrieval using it in an online fashion. The ALTree exploits the fact that many real world attributes come from a small value space. We show that our algorithm performs much better than the threshold based algorithms in terms of computational cost due to efficient pruning of the search space. Further, it out-performs them in terms of IOs by upto an order of magnitude in case of dense datasets.