Incremental maintenance of length normalized indexes for approximate string matching

Authors:
Marios Hadjieleftheriou;Nick Koudas;Divesh Srivastava
Affiliations:
AT&T Labs - Research, Florham Park, NJ, USA;University of Toronto, Toronto, ON, Canada;AT&T Labs - Research, Florham Park, NJ, USA
Venue:
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Year:
2009

Citing 14
Cited 10

Ubiquitous B-Tree

ACM Computing Surveys (CSUR)
Optimal aggregation algorithms for middleware

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Modern Information Retrieval

Modern Information Retrieval
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Record linkage: similarity measures and algorithms

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Benchmarking declarative approximate selection predicates

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
FASE: A Framework for Scalable Performance Prediction of HPC Systems and Applications

Simulation
Cost-based variable-length-gram selection for string collections to support approximate queries efficiently

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficient Merging and Filtering Algorithms for Approximate String Searches

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Fast Indexes and Algorithms for Set Similarity Selection Queries

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering

Efficient approximate search on string collections

Proceedings of the VLDB Endowment
Bed-tree: an all-purpose index structure for string similarity search based on edit distance

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Trie-join: efficient trie-based string similarity joins with edit-distance constraints

Proceedings of the VLDB Endowment
Approximate String Processing

Foundations and Trends in Databases
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient fuzzy full-text type-ahead search

The VLDB Journal — The International Journal on Very Large Data Bases
Pass-join: a partition-based method for similarity joins

Proceedings of the VLDB Endowment
Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints

Proceedings of the Joint EDBT/ICDT 2013 Workshops
A partition-based method for string similarity joins with edit-distance constraints

ACM Transactions on Database Systems (TODS)
Extending string similarity join to tolerant fuzzy token matching

ACM Transactions on Database Systems (TODS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Approximate string matching is a problem that has received a lot of attention recently. Existing work on information retrieval has concentrated on a variety of similarity measures TF/IDF, BM25, HMM, etc.) specifically tailored for document retrieval purposes. As new applications that depend on retrieving short strings are becoming popular(e.g., local search engines like YellowPages.com, Yahoo!Local, and Google Maps) new indexing methods are needed, tailored for short strings. For that purpose, a number of indexing techniques and related algorithms have been proposed based on length normalized similarity measures. A common denominator of indexes for length normalized measures is that maintaining the underlying structures in the presence of incremental updates is inefficient, mainly due to data dependent, precomputed weights associated with each distinct token and string. Incorporating updates usually is accomplished by rebuilding the indexes at regular time intervals. In this paper we present a framework that advocates lazy update propagation with the following key feature: Efficient, incremental updates that immediately reflect the new data in the indexes in a way that gives strict guarantees on the quality of subsequent query answers. More specifically, our techniques guarantee against false negatives and limit the number of false positives produced. We implement a fully working prototype and illustrate that the proposed ideas work really well in practice for real datasets.