Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Combining fuzzy information from multiple systems
Journal of Computer and System Sciences
On power-law relationships of the Internet topology
Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
A guided tour to approximate string matching
ACM Computing Surveys (CSUR)
Optimal aggregation algorithms for middleware
PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Introduction to Algorithms
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
Declarative Data Cleaning: Language, Model, and Algorithms
Proceedings of the 27th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
Text joins in an RDBMS for web data integration
WWW '03 Proceedings of the 12th international conference on World Wide Web
Efficient set joins on similarity predicates
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
On the integration of structure indexes and inverted lists
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Efficient top-K query calculation in distributed networks
Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing
A Primitive Operator for Similarity Joins in Data Cleaning
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Record linkage: similarity measures and algorithms
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Efficient exact set-similarity joins
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Spark: top-k keyword query in relational databases
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Benchmarking declarative approximate selection predicates
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Merging the results of approximate match operations
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Example-driven design of efficient record matching queries
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Efficient similarity joins for near duplicate detection
Proceedings of the 17th international conference on World Wide Web
A survey of top-k query processing techniques in relational database systems
ACM Computing Surveys (CSUR)
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints
Proceedings of the VLDB Endowment
Efficient interactive fuzzy keyword search
Proceedings of the 18th international conference on World wide web
Efficient Merging and Filtering Algorithms for Approximate String Searches
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Efficient approximate entity extraction with edit distance constraints
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient Common Items Extraction from Multiple Sorted Lists
APWEB '10 Proceedings of the 2010 12th International Asia-Pacific Web Conference
Approximate entity extraction in temporal databases
World Wide Web
Indexing dataspaces with partitions
World Wide Web
Hi-index | 0.00 |
In this paper, we study the problem of efficiently identifying K records that are most similar to a given query record, where the similarity is defined as: (1) for each record, we calculate the similarity score between the record and the query record over each individual attribute using a specific similarity function; (2) an aggregate function is utilized to combine these similarity scores with weights and the aggregated value is served as the similarity of the record. After similarities of all records have been computed, K records with the greatest similarities can further be identified. Under this framework, unfortunately, the computational cost will be extremely expensive when the cardinality of relation is large as computation of similarity for each record is required. As a result, in this paper, we propose two efficient algorithms, named ScanIndex and Top-Down (TD for short), to cope with this problem. With respect to ScanIndex, similarity scores that are equal to zero over individual attributes are free from computation. Based on ScanIndex, with respect to TD, similarity scores less than thresholds (rather than zero) over individual attributes are skipped, where these thresholds are improved dynamically over time. Experimental results demonstrate that, comparing with the naive approach, the performance can be improved by two orders of magnitude using ScanIndex and TD.