Estimating the selectivity of tf-idf based cosine similarity predicates

Authors:
Sandeep Tata;Jignesh M. Patel
Affiliations:
University of Michigan, Ann Arbor, Michigan;University of Michigan, Ann Arbor, Michigan
Venue:
ACM SIGMOD Record
Year:
2007

Citing 5
Cited 5

Text joins in an RDBMS for web data integration

WWW '03 Proceedings of the 12th international conference on World Wide Web
Distance Exponent: A New Concept for Selectivity Estimation in Metric Trees

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Web data integration using approximate string join

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Selectivity estimation for fuzzy string predicates in large data sets

VLDB '05 Proceedings of the 31st international conference on Very large data bases
FASE: A Framework for Scalable Performance Prediction of HPC Systems and Applications

Simulation

Structural feature based anomaly detection for packed executable identification

CISIS'11 Proceedings of the 4th international conference on Computational intelligence in security for information systems
Anomaly detection for the prediction of ultimate tensile strength in iron casting production

DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part II
Idea: opcode-sequence-based malware detection

ESSoS'10 Proceedings of the Second international conference on Engineering Secure Software and Systems
Automatic categorisation of comments in social news websites

Expert Systems with Applications: An International Journal
Can predicate-argument structures be used for contextual opinion retrieval from blogs?

World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

An increasing number of database applications today require sophisticated approximate string matching capabilities. Examples of such application areas include data integration and data cleaning. Cosine similarity has proven to be a robust metric for scoring the similarity between two strings, and it is increasingly being used in complex queries. An immediate challenge faced by current database optimizers is to find accurate and efficient methods for estimating the selectivity of cosine similarity predicates. To the best of our knowledge, there are no known methods for this problem. In this paper, we present the first approach for estimating the selectivity of tf.idf based cosine similarity predicates. We evaluate our approach on three different real datasets and show that our method often produces estimates that are within 40% of the actual selectivity.