Top-k Set Similarity Joins

Authors:
Chuan Xiao;Wei Wang;Xuemin Lin;Haichuan Shang
Affiliations:
-;-;-;-
Venue:
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Year:
2009

Citing 0
Cited 30

Efficient Set Similarity Joins Using Min-prefixes

ADBIS '09 Proceedings of the 13th East European Conference on Advances in Databases and Information Systems
Incremental all pairs similarity search for varying similarity thresholds

Proceedings of the 3rd Workshop on Social Network Mining and Analysis
Bed-tree: an all-purpose index structure for string similarity search based on edit distance

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
On indexing error-tolerant set containment

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Efficient processing of exact top-k queries over disk-resident sorted lists

The VLDB Journal — The International Journal on Very Large Data Bases
Generalizing prefix filtering to improve set similarity joins

Information Systems
Approximate entity extraction in temporal databases

World Wide Web
Finding the k-closest pairs in metric spaces

Proceedings of the 1st Workshop on New Trends in Similarity Search
Approximate String Processing

Foundations and Trends in Databases
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient similarity joins for near-duplicate detection

ACM Transactions on Database Systems (TODS)
Efficient fuzzy full-text type-ahead search

The VLDB Journal — The International Journal on Very Large Data Bases
Efficient duplicate detection on cloud using a new signature scheme

WAIM'11 Proceedings of the 12th international conference on Web-age information management
Efficient top-K approximate searches against a relation with multiple attributes

World Wide Web
Context-based entity description rule for entity resolution

Proceedings of the 20th ACM international conference on Information and knowledge management
Pass-join: a partition-based method for similarity joins

Proceedings of the VLDB Endowment
Efficient processing of probabilistic set-containment queries on uncertain set-valued data

Information Sciences: an International Journal
Can we beat the prefix filtering?: an adaptive framework for similarity join and search

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Relevance search in heterogeneous networks

Proceedings of the 15th International Conference on Extending Database Technology
CRSI: a compact randomized similarity index for set-valued features

Proceedings of the 15th International Conference on Extending Database Technology
Seal: spatio-textual similarity search

Proceedings of the VLDB Endowment
An optimized in-network aggregation scheme for data collection in periodic sensor networks

ADHOC-NOW'12 Proceedings of the 11th international conference on Ad-hoc, Mobile, and Wireless Networks
Detecting near-duplicate documents using sentence-level features and supervised learning

Expert Systems with Applications: An International Journal
Spatio-textual similarity joins

Proceedings of the VLDB Endowment
Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints

Proceedings of the Joint EDBT/ICDT 2013 Workshops
A partition-based method for string similarity joins with edit-distance constraints

ACM Transactions on Database Systems (TODS)
Scalable k-nearest neighbor graph construction based on greedy filtering

Proceedings of the 22nd international conference on World Wide Web companion
Entity resolution on uncertain relations

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Extending string similarity join to tolerant fuzzy token matching

ACM Transactions on Database Systems (TODS)
Top-K structural diversity search in large networks

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Similarity join is a useful primitive operation underlying many applications, such as near duplicate Web page detection, data integration, and pattern recognition. Traditional similarity joins require a user to specify a similarity threshold. In this paper, we study a variant of the similarity join, termed top-k set similarity join. It returns the top-k pairs of records ranked by their similarities, thus eliminating the guess work users have to perform when the similarity threshold is unknown before hand. An algorithm, topk-join, is proposed to answer top-k similarity join efficiently. It is based on the prefix filtering principle and employs tight upper bounding of similarity values of unseen pairs. Experimental results demonstrate the efficiency of the proposed algorithm on large-scale real datasets.