Equi-depth multidimensional histograms
SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Estimating alphanumeric selectivity in the presence of wildcards
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Improved histograms for selectivity estimation of range predicates
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
Selectively estimation for Boolean queries
PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
IntelliClean: a knowledge-based intelligent data cleaner
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Similarity Search without Tears: The OMNI Family of All-purpose Access Methods
Proceedings of the 17th International Conference on Data Engineering
Optimal Histograms with Quality Guarantees
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Multi-Dimensional Substring Selectivity Estimation
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Dynamic Maintenance of Wavelet-Based Histograms
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Estimating the Selectivity of XML Path Expressions for Internet Scale Applications
Proceedings of the 27th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
Efficient Record Linkage in Large Data Sets
DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
The optimization of queries in relational databases
The optimization of queries in relational databases
The power-method: a comprehensive estimation technique for multi-dimensional queries
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
A Probabilistic Approach to Metasearching with Adaptive Probing
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Selectivity Estimation for String Predicates: Overcoming the Underestimation Problem
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Indexing mixed types for approximate retrieval
VLDB '05 Proceedings of the 31st international conference on Very large data bases
XPathLearner: an on-line self-tuning Markov histogram for XML path selectivity estimation
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Eliminating fuzzy duplicates in data warehouses
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Indexing mixed types for approximate retrieval
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Estimating the selectivity of approximate string queries
ACM Transactions on Database Systems (TODS)
Extending q-grams to estimate selectivity of string matching with low edit distance
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Estimating the selectivity of tf-idf based cosine similarity predicates
ACM SIGMOD Record
Estimating the selectivity of tf-idf based cosine similarity predicates
ACM SIGMOD Record
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SEPIA: estimating selectivities of approximate string predicates in large Databases
The VLDB Journal — The International Journal on Very Large Data Bases
Hashed samples: selectivity estimators for set similarity selection queries
Proceedings of the VLDB Endowment
Approximate substring selectivity estimation
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Efficient interactive fuzzy keyword search
Proceedings of the 18th international conference on World wide web
Efficient approximate entity extraction with edit distance constraints
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient approximate search on string collections
Proceedings of the VLDB Endowment
Result-size estimation for information-retrieval subqueries
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Can we beat the prefix filtering?: an adaptive framework for similarity join and search
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Selectivity estimation for hybrid queries over text-rich data graphs
Proceedings of the 16th International Conference on Extending Database Technology
Efficient top-k algorithms for approximate substring matching
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Hi-index | 0.00 |
Many database applications have the emerging need to support fuzzy queries that ask for strings that are similar to a given string, such as "name similar to smith" and "telephone number similar to 412-0964." Query optimization needs the selectivity of such a fuzzy predicate, i.e., the fraction of records in the database that satisfy the condition. In this paper, we study the problem of estimating selectivities of fuzzy string predicates. We develop a novel technique, called SEPIA, to solve the problem. It groups strings into clusters, builds a histogram structure for each cluster, and constructs a global histogram for the database. It is based on the following intuition: given a query string q, a preselected string p in a cluster, and a string s in the cluster, based on the proximity between q and p, and the proximity between p and s, we can obtain a probability distribution from a global histogram about the similarity between q and s. We give a full specification of the technique using the edit distance function. We study challenges in adopting this technique, including how to construct the histogram structures, how to use them to do selectivity estimation, and how to alleviate the effect of non-uniform errors in the estimation. We discuss how to extend the techniques to other similarity functions. Our extensive experiments on real data sets show that this technique can accurately estimate selectivities of fuzzy string predicates.