Efficient interactive fuzzy keyword search
Proceedings of the 18th international conference on World wide web
Space-economical partial gram indices for exact substring matching
Proceedings of the 18th ACM conference on Information and knowledge management
Matching person names through name transformation
Proceedings of the 18th ACM conference on Information and knowledge management
Efficient approximate search on string collections
Proceedings of the VLDB Endowment
Bed-tree: an all-purpose index structure for string similarity search based on edit distance
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Fuzzy keyword search over encrypted data in cloud computing
INFOCOM'10 Proceedings of the 29th conference on Information communications
Simple and efficient algorithm for approximate dictionary matching
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Approximate entity extraction in temporal databases
World Wide Web
Foundations and Trends in Databases
ASTERIX: towards a scalable, semistructured data platform for evolving-world models
Distributed and Parallel Databases
Efficient similarity joins for near-duplicate detection
ACM Transactions on Database Systems (TODS)
A fast and accurate method for approximate string search
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Proceedings of the Joint EDBT/ICDT 2013 Workshops
FPI: a novel indexing method using frequent patterns for approximate string searches
Proceedings of the Joint EDBT/ICDT 2013 Workshops
A fast generative spell corrector based on edit distance
ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Efficient top-k algorithms for approximate substring matching
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
A partition-based method for string similarity joins with edit-distance constraints
ACM Transactions on Database Systems (TODS)
Clustering with Proximity Graphs: Exact and Efficient Algorithms
International Journal of Knowledge-Based Organizations
Hi-index | 0.00 |
Answering approximate queries on string collections is important in applications such as data cleaning, query relaxation, and spell checking, where inconsistencies and errors exist in user queries as well as data. Many existing algorithms use gram-based inverted-list indexing structures to answer approximate string queries. These indexing structures are "notoriously" large compared to the size of their original string collection. In this paper, we study how to reduce the size of such an indexing structure to a given amount of space, while retaining efficient query processing. We first study how to adopt existing inverted-list compression techniques to solve our problem. Then, we propose two novel approaches for achieving the goal: one is based on discarding gram lists, and one is based on combining correlated lists. They are both orthogonal to existing compression techniques, exploit a unique property of our setting, and offer new opportunities for improving query performance. For each approach we analyze its effect on query performance and develop algorithms for wisely choosing lists to discard or combine. Our extensive experiments on real data sets show that our approaches provide applications the flexibility in deciding the tradeoff between query performance and indexing size, and can outperform existing compression techniques. An interesting and surprising finding is that while we can reduce the index size significantly (up to 60% reduction) with tolerable performance penalties, for 20-40% reductions we can even improve query performance compared to original indexes.