Efficient Set Similarity Joins Using Min-prefixes
ADBIS '09 Proceedings of the 13th East European Conference on Advances in Databases and Information Systems
Incremental all pairs similarity search for varying similarity thresholds
Proceedings of the 3rd Workshop on Social Network Mining and Analysis
Bed-tree: an all-purpose index structure for string similarity search based on edit distance
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
On indexing error-tolerant set containment
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Efficient processing of exact top-k queries over disk-resident sorted lists
The VLDB Journal — The International Journal on Very Large Data Bases
Generalizing prefix filtering to improve set similarity joins
Information Systems
Approximate entity extraction in temporal databases
World Wide Web
Finding the k-closest pairs in metric spaces
Proceedings of the 1st Workshop on New Trends in Similarity Search
Foundations and Trends in Databases
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient similarity joins for near-duplicate detection
ACM Transactions on Database Systems (TODS)
Efficient fuzzy full-text type-ahead search
The VLDB Journal — The International Journal on Very Large Data Bases
Efficient duplicate detection on cloud using a new signature scheme
WAIM'11 Proceedings of the 12th international conference on Web-age information management
Context-based entity description rule for entity resolution
Proceedings of the 20th ACM international conference on Information and knowledge management
Pass-join: a partition-based method for similarity joins
Proceedings of the VLDB Endowment
Efficient processing of probabilistic set-containment queries on uncertain set-valued data
Information Sciences: an International Journal
Can we beat the prefix filtering?: an adaptive framework for similarity join and search
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Relevance search in heterogeneous networks
Proceedings of the 15th International Conference on Extending Database Technology
CRSI: a compact randomized similarity index for set-valued features
Proceedings of the 15th International Conference on Extending Database Technology
Seal: spatio-textual similarity search
Proceedings of the VLDB Endowment
An optimized in-network aggregation scheme for data collection in periodic sensor networks
ADHOC-NOW'12 Proceedings of the 11th international conference on Ad-hoc, Mobile, and Wireless Networks
Detecting near-duplicate documents using sentence-level features and supervised learning
Expert Systems with Applications: An International Journal
Spatio-textual similarity joins
Proceedings of the VLDB Endowment
Proceedings of the Joint EDBT/ICDT 2013 Workshops
A partition-based method for string similarity joins with edit-distance constraints
ACM Transactions on Database Systems (TODS)
Scalable k-nearest neighbor graph construction based on greedy filtering
Proceedings of the 22nd international conference on World Wide Web companion
Entity resolution on uncertain relations
WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Extending string similarity join to tolerant fuzzy token matching
ACM Transactions on Database Systems (TODS)
Top-K structural diversity search in large networks
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
Similarity join is a useful primitive operation underlying many applications, such as near duplicate Web page detection, data integration, and pattern recognition. Traditional similarity joins require a user to specify a similarity threshold. In this paper, we study a variant of the similarity join, termed top-k set similarity join. It returns the top-k pairs of records ranked by their similarities, thus eliminating the guess work users have to perform when the similarity threshold is unknown before hand. An algorithm, topk-join, is proposed to answer top-k similarity join efficiently. It is based on the prefix filtering principle and employs tight upper bounding of similarity values of unseen pairs. Experimental results demonstrate the efficiency of the proposed algorithm on large-scale real datasets.