SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
ACM Computing Surveys (CSUR)
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Finding Interesting Associations without Support Pruning
IEEE Transactions on Knowledge and Data Engineering
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Similarity Search in High Dimensions via Hashing
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
Efficient Record Linkage in Large Data Sets
DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Efficient processing of joins on set-valued attributes
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Efficient set joins on similarity predicates
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Efficient randomized pattern-matching algorithms
IBM Journal of Research and Development - Mathematics and computing
Evaluating similarity measures: a large-scale study in the orkut social network
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
A Primitive Operator for Similarity Joins in Data Cleaning
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Integrating XML data sources using approximate joins
ACM Transactions on Database Systems (TODS)
Efficient exact set-similarity joins
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
ACM Transactions on Database Systems (TODS)
Scaling up all pairs similarity search
Proceedings of the 16th international conference on World Wide Web
Benchmarking declarative approximate selection predicates
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
ACM Transactions on Database Systems (TODS)
Efficient similarity joins for near duplicate detection
Proceedings of the 17th international conference on World Wide Web
An efficient filter for approximate membership checking
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SpotSigs: robust and efficient near duplicate detection in large web collections
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints
Proceedings of the VLDB Endowment
Transformation-based Framework for Record Matching
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Efficient Merging and Filtering Algorithms for Approximate String Searches
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Fast Indexes and Algorithms for Set Similarity Selection Queries
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
A Fast Similarity Join Algorithm Using Graphics Processing Units
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Efficient Set Similarity Joins Using Min-prefixes
ADBIS '09 Proceedings of the 13th East European Conference on Advances in Databases and Information Systems
Power-Law Distributions in Empirical Data
SIAM Review
Power-law based estimation of set similarity join size
Proceedings of the VLDB Endowment
PG-join: proximity graph based string similarity joins
SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
Ingredients for accurate, fast, and robust XML similarity joins
DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part II
Leveraging the storage layer to support XML similarity joins in XDBMSs
ADBIS'12 Proceedings of the 16th East European conference on Advances in Databases and Information Systems
Hi-index | 0.00 |
Identification of all pairs of objects in a dataset whose similarity is not less than a specified threshold is of major importance for management, search, and analysis of data. Set similarity joins are commonly used to implement this operation; they scale to large datasets and are versatile to represent a variety of similarity notions. Most methods proposed so far present two main phases at a high level of abstraction: candidate generation producing a set of candidate pairs and verification applying the actual similarity measure to the candidates and returning the correct answer. Previous work has primarily focused on the reduction of candidates, where candidate generation presented the major effort to obtain better pruning results. Here, we propose an opposite approach. We drastically decrease the computational cost of candidate generation by dynamically reducing the number of indexed objects at the expense of increasing the workload of the verification phase. Our experimental findings show that this trade-off is advantageous: we consistently achieve substantial speed-ups as compared to known algorithms.