An efficient filter for approximate membership checking

Authors:
Kaushik Chakrabarti;Surajit Chaudhuri;Venkatesh Ganti;Dong Xin
Affiliations:
Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA
Venue:
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Year:
2008

Citing 16
Cited 25

An algorithm for approximate membership checking with application to password security

Information Processing Letters
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Efficient string matching: an aid to bibliographic search

Communications of the ACM
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Text indexing and dictionary matching with one error

Journal of Algorithms
Optimal aggregation algorithms for middleware

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Approximate Dictionary Queries

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
Finding Interesting Associations without Support Pruning

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient Batch Top-k Search for Dictionary-based Entity Recognition

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Ranking objects based on relationships

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
MaxMatcher: biological concept extraction using approximate dictionary lookup

PRICAI'06 Proceedings of the 9th Pacific Rim international conference on Artificial intelligence

Scalable ad-hoc entity extraction from text collections

Proceedings of the VLDB Endowment
Exploiting web search to generate synonyms for entities

Proceedings of the 18th international conference on World wide web
Efficient interactive fuzzy keyword search

Proceedings of the 18th international conference on World wide web
Exploiting web search engines to search structured databases

Proceedings of the 18th international conference on World wide web
A grammar-based entity representation framework for data cleaning

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient Set Similarity Joins Using Min-prefixes

ADBIS '09 Proceedings of the 13th East European Conference on Advances in Databases and Information Systems
Efficient algorithms for approximate member extraction using signature-based inverted lists

Proceedings of the 18th ACM conference on Information and knowledge management
Efficient approximate search on string collections

Proceedings of the VLDB Endowment
Mining document collections to facilitate accurate approximate entity matching

Proceedings of the VLDB Endowment
Query portals: dynamically generating portals for entity-oriented web queries

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Generalizing prefix filtering to improve set similarity joins

Information Systems
Approximate membership localization (AML) for web-based join

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Extending dictionary-based entity extraction to tolerate errors

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
SigMatch: fast and scalable multi-pattern matching

Proceedings of the VLDB Endowment
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient fuzzy full-text type-ahead search

The VLDB Journal — The International Journal on Very Large Data Bases
Continuously monitoring the correlations of massive discrete streams

Proceedings of the 20th ACM international conference on Information and knowledge management
Pass-join: a partition-based method for similarity joins

Proceedings of the VLDB Endowment
PartSS: an efficient partition-based filtering for edit distance constraints

ADC '11 Proceedings of the Twenty-Second Australasian Database Conference - Volume 115
Graph-based reference table construction to facilitate entity matching

Journal of Systems and Software
A partition-based method for string similarity joins with edit-distance constraints

ACM Transactions on Database Systems (TODS)
Efficient parsing-based search over structured data

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Extending string similarity join to tolerant fuzzy token matching

ACM Transactions on Database Systems (TODS)
Chemical Name Extraction Based on Automatic Training Data Generation and Rich Feature Set

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of identifying sub-strings of input text strings that approximately match with some member of a potentially large dictionary. This problem arises in several important applications such as extracting named entities from text documents and identifying biological concepts from biomedical literature. In this paper, we develop a filter-verification framework, and propose a novel in-memory filter structure. That is, we first quickly filter out sub-strings that cannot match with any dictionary member, and then verify the remaining sub-strings against the dictionary. Our method does not produce false negatives. We demonstrate the efficiency and effectiveness of our filter over real datasets, and show that it significantly outperforms the previous best-known methods in terms of both filtering power and computation time.