Simple and efficient algorithm for approximate dictionary matching

Authors:
Naoaki Okazaki;Jun'ichi Tsujii
Affiliations:
University of Tokyo;University of Tokyo and University of Manchester
Venue:
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Year:
2010

Citing 19
Cited 7

Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
n-gram/2L: a space and time efficient two-level n-gram inverted index structure

VLDB '05 Proceedings of the 31st international conference on Very large data bases
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient Batch Top-k Search for Dictionary-based Entity Recognition

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Approximate string matching using compressed suffix arrays

Theoretical Computer Science
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
Information-theoretic metric learning

Proceedings of the 24th international conference on Machine learning
Extending q-grams to estimate selectivity of string matching with low edit distance

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
VGRAM: improving performance of approximate queries on string collections using variable-length grams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Effective Indices for Efficient Approximate String Search and Similarity Join

WAIM '08 Proceedings of the 2008 The Ninth International Conference on Web-Age Information Management
Efficient Merging and Filtering Algorithms for Approximate String Searches

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

EVEX: a pubmed-scale resource for homology-based generalization of text mining predictions

BioNLP '11 Proceedings of BioNLP 2011 Workshop
SimSem: fast approximate string matching in relation to semantic category disambiguation

BioNLP '11 Proceedings of BioNLP 2011 Workshop
Exploiting evidence from unstructured data to enhance master data management

Proceedings of the VLDB Endowment
Developing multilingual text mining workflows in UIMA and u-compare

NLDB'12 Proceedings of the 17th international conference on Applications of Natural Language Processing and Information Systems
Leveraging Diverse Lexical Resources for Textual Entailment Recognition

ACM Transactions on Asian Language Information Processing (TALIP) - Special Issue on RITE
A fast generative spell corrector based on edit distance

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
A comparison of different calculations for N-gram similarities in a spelling corrector for mobile instant messaging language

Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a simple and efficient algorithm for approximate dictionary matching designed for similarity measures such as cosine, Dice, Jaccard, and overlap coefficients. We propose this algorithm, called CPMerge, for the τ-overlap join of inverted lists. First we show that this task is solvable exactly by a τ-overlap join. Given inverted lists retrieved for a query, the algorithm collects fewer candidate strings and prunes unlikely candidates to efficiently find strings that satisfy the constraint of the τ-overlap join. We conducted experiments of approximate dictionary matching on three large-scale datasets that include person names, biomedical names, and general English words. The algorithm exhibited scalable performance on the datasets. For example, it retrieved strings in 1.1 ms from the string collection of Google Web1T unigrams (with cosine similarity and threshold 0.7).