Approximate String Processing

Authors:
Marios Hadjieleftheriou;Divesh Srivastava
Affiliations:
-;-
Venue:
Foundations and Trends in Databases
Year:
2011

Citing 69
Cited 3

Algorithms for approximate string matching

Information and Control
Fast approximate string matching

Software—Practice & Experience
Fast text searching: allowing errors

Communications of the ACM
Approximate string-matching with q-grams and maximal matches

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
An algorithm for approximate membership checking with application to password security

Information Processing Letters
String searching algorithms

String searching algorithms
Dictionary look-up with one error

Journal of Algorithms
The art of computer programming, volume 3: (2nd ed.) sorting and searching

The art of computer programming, volume 3: (2nd ed.) sorting and searching
The string B-tree: a new data structure for string search in external memory and its applications

Journal of the ACM (JACM)
Efficient algorithms for approximate string matching with swaps

Journal of Complexity
The String-to-String Correction Problem

Journal of the ACM (JACM)
Ubiquitous B-Tree

ACM Computing Surveys (CSUR)
The string-to-string correction problem with block moves

ACM Transactions on Computer Systems (TOCS)
Improved bounds for dictionary look-up with one error

Information Processing Letters
Implementation of the substring test by hashing

Communications of the ACM
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Optimal aggregation algorithms for middleware

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Modern Information Retrieval

Modern Information Retrieval
Managing Gigabytes: Compressing and Indexing Documents and Images

Managing Gigabytes: Compressing and Indexing Documents and Images
Introduction to Algorithms

Introduction to Algorithms
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Approximate Pattern Matching with Samples

ISAAC '94 Proceedings of the 5th International Symposium on Algorithms and Computation
Dictionary Look-Up within Small Edit Distance

COCOON '02 Proceedings of the 8th Annual International Conference on Computing and Combinatorics
Approximate String-Matching over Suffix Trees

CPM '93 Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching
A Fast Filtration Algorithm for the Substring Matching Problem

CPM '93 Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching
Exact and Approximation Algorithms for the Inversion Distance Between Two Chromosomes

CPM '93 Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching
Approximate Dictionary Queries

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
On Using q-Gram Locations in Approximate String Matching

ESA '95 Proceedings of the Third Annual European Symposium on Algorithms
Pattern matching with swaps

FOCS '97 Proceedings of the 38th Annual Symposium on Foundations of Computer Science
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
TAILOR: A Record Linkage Tool Box

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Dictionary matching and indexing with errors and don't cares

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Efficient randomized pattern-matching algorithms

IBM Journal of Research and Development - Mathematics and computing
Approximating Edit Distance Efficiently

FOCS '04 Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science
Perceptrons: An Introduction to Computational Geometry

Perceptrons: An Introduction to Computational Geometry
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Data integration: the teenage years

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
The string edit distance matching problem with moves

ACM Transactions on Algorithms (TALG)
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
A taxonomy of suffix array construction algorithms

ACM Computing Surveys (CSUR)
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Leveraging aggregate constraints for deduplication

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Compressed indexes for approximate string matching

ESA'06 Proceedings of the 14th conference on Annual European Symposium - Volume 14
Low distortion embeddings for edit distance

Journal of the ACM (JACM)
FASE: A Framework for Scalable Performance Prediction of HPC Systems and Applications

Simulation
VGRAM: improving performance of approximate queries on string collections using variable-length grams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Cost-based variable-length-gram selection for string collections to support approximate queries efficiently

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficient online index construction for text databases

ACM Transactions on Database Systems (TODS)
Approximate string matching in sublinear expected time

SFCS '90 Proceedings of the 31st Annual Symposium on Foundations of Computer Science
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment
Efficient interactive fuzzy keyword search

Proceedings of the 18th international conference on World wide web
Approximating edit distance in near-linear time

Proceedings of the forty-first annual ACM symposium on Theory of computing
Swoosh: a generic approach to entity resolution

The VLDB Journal — The International Journal on Very Large Data Bases
Transformation-based Framework for Record Matching

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Efficient Merging and Filtering Algorithms for Approximate String Searches

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Fast Indexes and Algorithms for Set Similarity Selection Queries

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Top-k Set Similarity Joins

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Efficient top-k algorithms for fuzzy search in string collections

Proceedings of the First International Workshop on Keyword Search on Structured Data
Incremental maintenance of length normalized indexes for approximate string matching

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Extending autocompletion to tolerate errors

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Learning string transformations from examples

Proceedings of the VLDB Endowment
Bed-tree: an all-purpose index structure for string similarity search based on edit distance

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A linear size index for approximate pattern matching

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Cache-oblivious index for approximate string matching

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching

Trie-based similarity search and join

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Efficient parsing-based search over structured data

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Asymmetric signature schemes for efficient exact edit similarity query processing

ACM Transactions on Database Systems (TODS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the most important primitive data types in modern data processing is text. Text data are known to have a variety of inconsistencies (e.g., spelling mistakes and representational variations). For that reason, there exists a large body of literature related to approximate processing of text. This monograph focuses specifically on the problem of approximate string matching, where, given a set of strings S and a query string v, the goal is to find all strings s ∈ S that have a user specified degree of similarity to v. Set S could be, for example, a corpus of documents, a set of web pages, or an attribute of a relational table. The similarity between strings is always defined with respect to a similarity function that is chosen based on the characteristics of the data and application at hand. This work presents a survey of indexing techniques and algorithms specifically designed for approximate string matching. We concentrate on inverted indexes, filtering techniques, and tree data structures that can be used to evaluate a variety of set based and edit based similarity functions. We focus on all-match and top-k flavors of selection and join queries, and discuss the applicability, advantages and disadvantages of each technique for every query type.