Mining association rules between sets of items in large databases
SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
A new method for similarity indexing of market basket data
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Modern Information Retrieval
Similarity Search in High Dimensions via Hashing
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Set Containment Joins: The Good, The Bad and The Ugly
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Adaptive algorithms for set containment joins
ACM Transactions on Database Systems (TODS)
Discovering all most specific sentences
ACM Transactions on Database Systems (TODS)
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Efficient processing of joins on set-valued attributes
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Efficient set joins on similarity predicates
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Advances in frequent itemset mining implementations: report on FIMI'03
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
A Heterogeneous Field Matching Method for Record Linkage
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
A Primitive Operator for Similarity Joins in Data Cleaning
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Record linkage: similarity measures and algorithms
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Efficient exact set-similarity joins
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Heavy-tailed distributions and multi-keyword queries
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Challenges in building large-scale information retrieval systems: invited talk
Proceedings of the Second ACM International Conference on Web Search and Data Mining
What Sperner Family Concept Class is Easy to Be Enumerated?
ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Efficient interactive fuzzy keyword search
Proceedings of the 18th international conference on World wide web
Transformation-based Framework for Record Matching
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Fast Indexes and Algorithms for Set Similarity Selection Queries
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Efficient type-ahead search on relational data: a TASTIER approach
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient approximate search on string collections
Proceedings of the VLDB Endowment
Mining document collections to facilitate accurate approximate entity matching
Proceedings of the VLDB Endowment
Learning string transformations from examples
Proceedings of the VLDB Endowment
Efficient answering of set containment queries for skewed item distributions
Proceedings of the 14th International Conference on Extending Database Technology
Efficient processing of probabilistic set-containment queries on uncertain set-valued data
Information Sciences: an International Journal
Filtering and ranking schemes for finding inclusion dependencies on the web
Proceedings of the 21st international conference companion on World Wide Web
Efficient filtering and ranking schemes for finding inclusion dependencies on the web
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Efficient parsing-based search over structured data
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Hi-index | 0.00 |
Prior work has identified set based comparisons as a useful primitive for supporting a wide variety of similarity functions in record matching. Accordingly, various techniques have been proposed to improve the performance of set similarity lookups. However, this body of work focuses almost exclusively on symmetric notions of set similarity. In this paper, we study the indexing problem for the asymmetric Jaccard containment similarity function that is an error-tolerant variation of set containment. We enhance this similarity function to also account for string transformations that reflect synonyms such as "Bob" and "Robert" referring to the same first name. We propose an index structure that builds inverted lists on carefully chosen token-sets and a lookup algorithm using our index that is sensitive to the output size of the query. Our experiments over real life data sets show the benefits of our techniques. To our knowledge, this is the first paper that studies the indexing problem for Jaccard containment in the presence of string transformations.