Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

Authors:
Alexander Behm;Shengyue Ji;Chen Li;Jiaheng Lu
Affiliations:
-;-;-;-
Venue:
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Year:
2009

Citing 0
Cited 18

Efficient interactive fuzzy keyword search

Proceedings of the 18th international conference on World wide web
Space-economical partial gram indices for exact substring matching

Proceedings of the 18th ACM conference on Information and knowledge management
Matching person names through name transformation

Proceedings of the 18th ACM conference on Information and knowledge management
Efficient approximate search on string collections

Proceedings of the VLDB Endowment
Bed-tree: an all-purpose index structure for string similarity search based on edit distance

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Fuzzy keyword search over encrypted data in cloud computing

INFOCOM'10 Proceedings of the 29th conference on Information communications
Simple and efficient algorithm for approximate dictionary matching

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Approximate entity extraction in temporal databases

World Wide Web
Approximate String Processing

Foundations and Trends in Databases
ASTERIX: towards a scalable, semistructured data platform for evolving-world models

Distributed and Parallel Databases
Efficient similarity joins for near-duplicate detection

ACM Transactions on Database Systems (TODS)
A fast and accurate method for approximate string search

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints

Proceedings of the Joint EDBT/ICDT 2013 Workshops
FPI: a novel indexing method using frequent patterns for approximate string searches

Proceedings of the Joint EDBT/ICDT 2013 Workshops
A fast generative spell corrector based on edit distance

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Efficient top-k algorithms for approximate substring matching

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
A partition-based method for string similarity joins with edit-distance constraints

ACM Transactions on Database Systems (TODS)
Clustering with Proximity Graphs: Exact and Efficient Algorithms

International Journal of Knowledge-Based Organizations

Quantified Score

Hi-index	0.00

Visualization

Abstract

Answering approximate queries on string collections is important in applications such as data cleaning, query relaxation, and spell checking, where inconsistencies and errors exist in user queries as well as data. Many existing algorithms use gram-based inverted-list indexing structures to answer approximate string queries. These indexing structures are "notoriously" large compared to the size of their original string collection. In this paper, we study how to reduce the size of such an indexing structure to a given amount of space, while retaining efﬁcient query processing. We ﬁrst study how to adopt existing inverted-list compression techniques to solve our problem. Then, we propose two novel approaches for achieving the goal: one is based on discarding gram lists, and one is based on combining correlated lists. They are both orthogonal to existing compression techniques, exploit a unique property of our setting, and offer new opportunities for improving query performance. For each approach we analyze its effect on query performance and develop algorithms for wisely choosing lists to discard or combine. Our extensive experiments on real data sets show that our approaches provide applications the ﬂexibility in deciding the tradeoff between query performance and indexing size, and can outperform existing compression techniques. An interesting and surprising ﬁnding is that while we can reduce the index size signiﬁcantly (up to 60% reduction) with tolerable performance penalties, for 20-40% reductions we can even improve query performance compared to original indexes.