Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
Efficient set joins on similarity predicates
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
n-gram/2L: a space and time efficient two-level n-gram inverted index structure
VLDB '05 Proceedings of the 31st international conference on Very large data bases
A Primitive Operator for Similarity Joins in Data Cleaning
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient Batch Top-k Search for Dictionary-based Entity Recognition
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Approximate string matching using compressed suffix arrays
Theoretical Computer Science
Finding near-duplicate web pages: a large-scale evaluation of algorithms
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient exact set-similarity joins
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
Information-theoretic metric learning
Proceedings of the 24th international conference on Machine learning
Extending q-grams to estimate selectivity of string matching with low edit distance
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Effective Indices for Efficient Approximate String Search and Similarity Join
WAIM '08 Proceedings of the 2008 The Ninth International Conference on Web-Age Information Management
Efficient Merging and Filtering Algorithms for Approximate String Searches
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Efficient approximate entity extraction with edit distance constraints
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
EVEX: a pubmed-scale resource for homology-based generalization of text mining predictions
BioNLP '11 Proceedings of BioNLP 2011 Workshop
SimSem: fast approximate string matching in relation to semantic category disambiguation
BioNLP '11 Proceedings of BioNLP 2011 Workshop
Exploiting evidence from unstructured data to enhance master data management
Proceedings of the VLDB Endowment
Developing multilingual text mining workflows in UIMA and u-compare
NLDB'12 Proceedings of the 17th international conference on Applications of Natural Language Processing and Information Systems
Leveraging Diverse Lexical Resources for Textual Entailment Recognition
ACM Transactions on Asian Language Information Processing (TALIP) - Special Issue on RITE
A fast generative spell corrector based on edit distance
ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference
Hi-index | 0.00 |
This paper presents a simple and efficient algorithm for approximate dictionary matching designed for similarity measures such as cosine, Dice, Jaccard, and overlap coefficients. We propose this algorithm, called CPMerge, for the τ-overlap join of inverted lists. First we show that this task is solvable exactly by a τ-overlap join. Given inverted lists retrieved for a query, the algorithm collects fewer candidate strings and prunes unlikely candidates to efficiently find strings that satisfy the constraint of the τ-overlap join. We conducted experiments of approximate dictionary matching on three large-scale datasets that include person names, biomedical names, and general English words. The algorithm exhibited scalable performance on the datasets. For example, it retrieved strings in 1.1 ms from the string collection of Google Web1T unigrams (with cosine similarity and threshold 0.7).