Efficient algorithms for approximate member extraction using signature-based inverted lists

Authors:
Jiaheng Lu;Jialong Han;Xiaofeng Meng
Affiliations:
Renmin University of China, Beijing, China;Renmin University of China, Beijing, China;Renmin University of China, Beijing, China
Venue:
Proceedings of the 18th ACM conference on Information and knowledge management
Year:
2009

Citing 13
Cited 4

Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Computers and Intractability; A Guide to the Theory of NP-Completeness

Computers and Intractability; A Guide to the Theory of NP-Completeness
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
On Using q-Gram Locations in Approximate String Matching

ESA '95 Proceedings of the Third Annual European Symposium on Algorithms
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient Batch Top-k Search for Dictionary-based Entity Recognition

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
VGRAM: improving performance of approximate queries on string collections using variable-length grams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
An efficient filter for approximate membership checking

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficient Merging and Filtering Algorithms for Approximate String Searches

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Approximate membership localization (AML) for web-based join

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
PartSS: an efficient partition-based filtering for edit distance constraints

ADC '11 Proceedings of the Twenty-Second Australasian Database Conference - Volume 115
Extending string similarity join to tolerant fuzzy token matching

ACM Transactions on Database Systems (TODS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study the problem of approximate membership extraction (AME), i.e., how to efficiently extract substrings in a text document that approximately match some strings in a given dictionary. This problem is important in a variety of applications such as named entity recognition and data cleaning. We solve this problem in two steps. In the first step, for each substring in the text, we filter away the strings in the dictionary that are very different from the substring. In the second step, each candidate string is verified to decide whether the substring should be extracted. We develop an incremental algorithm using signature-based inverted lists to minimize the duplicate list-scan operations of overlapping windows in the text. Our experimental study of the proposed algorithms on real and synthetic datasets showed that our solutions significantly outperform existing methods in the literature.