On indexing error-tolerant set containment

Authors:
Parag Agrawal;Arvind Arasu;Raghav Kaushik
Affiliations:
Stanford University, Palo Alto, CA, USA;Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA
Venue:
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Year:
2010

Citing 29
Cited 5

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
A new method for similarity indexing of market basket data

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Modern Information Retrieval

Modern Information Retrieval
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Set Containment Joins: The Good, The Bad and The Ugly

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Adaptive algorithms for set containment joins

ACM Transactions on Database Systems (TODS)
Discovering all most specific sentences

ACM Transactions on Database Systems (TODS)
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Efficient processing of joins on set-valued attributes

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Advances in frequent itemset mining implementations: report on FIMI'03

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
A Heterogeneous Field Matching Method for Record Linkage

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Record linkage: similarity measures and algorithms

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Heavy-tailed distributions and multi-keyword queries

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Challenges in building large-scale information retrieval systems: invited talk

Proceedings of the Second ACM International Conference on Web Search and Data Mining
What Sperner Family Concept Class is Easy to Be Enumerated?

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Efficient interactive fuzzy keyword search

Proceedings of the 18th international conference on World wide web
Transformation-based Framework for Record Matching

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Fast Indexes and Algorithms for Set Similarity Selection Queries

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Top-k Set Similarity Joins

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Efficient type-ahead search on relational data: a TASTIER approach

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient approximate search on string collections

Proceedings of the VLDB Endowment
Mining document collections to facilitate accurate approximate entity matching

Proceedings of the VLDB Endowment
Learning string transformations from examples

Proceedings of the VLDB Endowment

Efficient answering of set containment queries for skewed item distributions

Proceedings of the 14th International Conference on Extending Database Technology
Efficient processing of probabilistic set-containment queries on uncertain set-valued data

Information Sciences: an International Journal
Filtering and ranking schemes for finding inclusion dependencies on the web

Proceedings of the 21st international conference companion on World Wide Web
Efficient filtering and ranking schemes for finding inclusion dependencies on the web

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Efficient parsing-based search over structured data

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Prior work has identified set based comparisons as a useful primitive for supporting a wide variety of similarity functions in record matching. Accordingly, various techniques have been proposed to improve the performance of set similarity lookups. However, this body of work focuses almost exclusively on symmetric notions of set similarity. In this paper, we study the indexing problem for the asymmetric Jaccard containment similarity function that is an error-tolerant variation of set containment. We enhance this similarity function to also account for string transformations that reflect synonyms such as "Bob" and "Robert" referring to the same first name. We propose an index structure that builds inverted lists on carefully chosen token-sets and a lookup algorithm using our index that is sensitive to the output size of the query. Our experiments over real life data sets show the benefits of our techniques. To our knowledge, this is the first paper that studies the indexing problem for Jaccard containment in the presence of string transformations.