Fast Indexes and Algorithms for Set Similarity Selection Queries

Authors:
Marios Hadjieleftheriou;Amit Chandel;Nick Koudas;Divesh Srivastava
Affiliations:
AT&TLabs-Research, Florham Park, NJ 07932, USA. marioh@research.att.com;Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada. amit@cs.toronto.edu;Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada. koudas@cs.toronto.edu;AT&TLabs-Research, Florham Park, NJ 07932, USA. divesh@research.att.com
Venue:
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Year:
2008

Citing 0
Cited 36

Efficient Similarity Search for Tree-Structured Data

SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Hashed samples: selectivity estimators for set similarity selection queries

Proceedings of the VLDB Endowment
Efficient interactive fuzzy keyword search

Proceedings of the 18th international conference on World wide web
Efficient top-k algorithms for fuzzy search in string collections

Proceedings of the First International Workshop on Keyword Search on Structured Data
Incremental maintenance of length normalized indexes for approximate string matching

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient Set Similarity Joins Using Min-prefixes

ADBIS '09 Proceedings of the 13th East European Conference on Advances in Databases and Information Systems
A cluster-based approach to XML similarity joins

IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
Frameworks for entity matching: A comparison

Data & Knowledge Engineering
Efficient approximate search on string collections

Proceedings of the VLDB Endowment
Power-law based estimation of set similarity join size

Proceedings of the VLDB Endowment
Suggestion of promising result types for XML keyword search

Proceedings of the 13th International Conference on Extending Database Technology
On active learning of record matching packages

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Bed-tree: an all-purpose index structure for string similarity search based on edit distance

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
On indexing error-tolerant set containment

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Generalizing prefix filtering to improve set similarity joins

Information Systems
Supporting location-based approximate-keyword queries

Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems
Trie-join: efficient trie-based string similarity joins with edit-distance constraints

Proceedings of the VLDB Endowment
Efficient answering of set containment queries for skewed item distributions

Proceedings of the 14th International Conference on Extending Database Technology
Approximate String Processing

Foundations and Trends in Databases
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient similarity joins for near-duplicate detection

ACM Transactions on Database Systems (TODS)
No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Efficient fuzzy full-text type-ahead search

The VLDB Journal — The International Journal on Very Large Data Bases
K-graphs: selecting top-k data sources for XML keyword queries

DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
Continuously monitoring the correlations of massive discrete streams

Proceedings of the 20th ACM international conference on Information and knowledge management
Pass-join: a partition-based method for similarity joins

Proceedings of the VLDB Endowment
Can we beat the prefix filtering?: an adaptive framework for similarity join and search

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
ASTERIX: scalable warehouse-style web data integration

Proceedings of the Ninth International Workshop on Information Integration on the Web
Supporting efficient top-k queries in type-ahead search

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints

Proceedings of the Joint EDBT/ICDT 2013 Workshops
A partition-based method for string similarity joins with edit-distance constraints

ACM Transactions on Database Systems (TODS)
Similarity queries: their conceptual evaluation, transformations, and processing

The VLDB Journal — The International Journal on Very Large Data Bases
Asymmetric signature schemes for efficient exact edit similarity query processing

ACM Transactions on Database Systems (TODS)
Extending string similarity join to tolerant fuzzy token matching

ACM Transactions on Database Systems (TODS)
XML keyword search with promising result type recommendations

World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data collections often have inconsistencies that arise due to a variety of reasons, and it is desirable to be able to identify and resolve them efficiently. Set similarity queries are commonly used in data cleaning for matching similar data. In this work we concentrate on set similarity selection queries: Given a query set, retrieve all sets in a collection with similarity greater than some threshold. Various set similarity measures have been proposed in the past for data cleaning purposes. In this work we concentrate on weighted similarity functions like TF/IDF, and introduce variants that are well suited for set similarity selections in a relational database context. These variants have special semantic properties that can be exploited to design very efficient index structures and algorithms for answering queries efficiently. We present modifications of existing technologies to work for set similarity selection queries. We also introduce three novel algorithms based on the Threshold Algorithm, that exploit the semantic properties of the new similarity measures to achieve the best performance in theory and practice.