Text joins in an RDBMS for web data integration
WWW '03 Proceedings of the 12th international conference on World Wide Web
Distance Exponent: A New Concept for Selectivity Estimation in Metric Trees
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Web data integration using approximate string join
Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Selectivity estimation for fuzzy string predicates in large data sets
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Hashed samples: selectivity estimators for set similarity selection queries
Proceedings of the VLDB Endowment
To catch a predator: a natural language approach for eliciting malicious payloads
SS'08 Proceedings of the 17th conference on Security symposium
An E-collaborative learning environment based on dynamic workflow system
ITHET'10 Proceedings of the 9th international conference on Information technology based higher education and training
A framework for corroborating answers from multiple web sources
Information Systems
How unique and traceable are usernames?
PETS'11 Proceedings of the 11th international conference on Privacy enhancing technologies
Hi-index | 0.00 |
An increasing number of database applications today require sophisticated approximate string matching capabilities. Examples of such application areas include data integration and data cleaning. Cosine similarity has proven to be a robust metric for scoring the similarity between two strings, and it is increasingly being used in complex queries. An immediate challenge faced by current database optimizers is to find accurate and efficient methods for estimating the selectivity of cosine similarity predicates. To the best of our knowledge, there are no known methods for this problem. In this paper, we present the first approach for estimating the selectivity of tf.idf based cosine similarity predicates. We evaluate our approach on three different real datasets and show that our method often produces estimates that are within 40% of the actual selectivity.