An algorithm for approximate membership checking with application to password security
Information Processing Letters
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Efficient string matching: an aid to bibliographic search
Communications of the ACM
Space/time trade-offs in hash coding with allowable errors
Communications of the ACM
Text indexing and dictionary matching with one error
Journal of Algorithms
Optimal aggregation algorithms for middleware
PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Similarity Search in High Dimensions via Hashing
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
Approximate Dictionary Queries
CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
Finding Interesting Associations without Support Pruning
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Efficient set joins on similarity predicates
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A Primitive Operator for Similarity Joins in Data Cleaning
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient Batch Top-k Search for Dictionary-based Entity Recognition
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Ranking objects based on relationships
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Efficient exact set-similarity joins
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
MaxMatcher: biological concept extraction using approximate dictionary lookup
PRICAI'06 Proceedings of the 9th Pacific Rim international conference on Artificial intelligence
Scalable ad-hoc entity extraction from text collections
Proceedings of the VLDB Endowment
Exploiting web search to generate synonyms for entities
Proceedings of the 18th international conference on World wide web
Efficient interactive fuzzy keyword search
Proceedings of the 18th international conference on World wide web
Exploiting web search engines to search structured databases
Proceedings of the 18th international conference on World wide web
A grammar-based entity representation framework for data cleaning
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient approximate entity extraction with edit distance constraints
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient Set Similarity Joins Using Min-prefixes
ADBIS '09 Proceedings of the 13th East European Conference on Advances in Databases and Information Systems
Efficient algorithms for approximate member extraction using signature-based inverted lists
Proceedings of the 18th ACM conference on Information and knowledge management
Efficient approximate search on string collections
Proceedings of the VLDB Endowment
Mining document collections to facilitate accurate approximate entity matching
Proceedings of the VLDB Endowment
Query portals: dynamically generating portals for entity-oriented web queries
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Generalizing prefix filtering to improve set similarity joins
Information Systems
Approximate membership localization (AML) for web-based join
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Extending dictionary-based entity extraction to tolerate errors
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
SigMatch: fast and scalable multi-pattern matching
Proceedings of the VLDB Endowment
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient fuzzy full-text type-ahead search
The VLDB Journal — The International Journal on Very Large Data Bases
Continuously monitoring the correlations of massive discrete streams
Proceedings of the 20th ACM international conference on Information and knowledge management
Pass-join: a partition-based method for similarity joins
Proceedings of the VLDB Endowment
PartSS: an efficient partition-based filtering for edit distance constraints
ADC '11 Proceedings of the Twenty-Second Australasian Database Conference - Volume 115
Graph-based reference table construction to facilitate entity matching
Journal of Systems and Software
A partition-based method for string similarity joins with edit-distance constraints
ACM Transactions on Database Systems (TODS)
Efficient parsing-based search over structured data
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Extending string similarity join to tolerant fuzzy token matching
ACM Transactions on Database Systems (TODS)
Chemical Name Extraction Based on Automatic Training Data Generation and Rich Feature Set
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Hi-index | 0.00 |
We consider the problem of identifying sub-strings of input text strings that approximately match with some member of a potentially large dictionary. This problem arises in several important applications such as extracting named entities from text documents and identifying biological concepts from biomedical literature. In this paper, we develop a filter-verification framework, and propose a novel in-memory filter structure. That is, we first quickly filter out sub-strings that cannot match with any dictionary member, and then verify the remaining sub-strings against the dictionary. Our method does not produce false negatives. We demonstrate the efficiency and effectiveness of our filter over real datasets, and show that it significantly outperforms the previous best-known methods in terms of both filtering power and computation time.