Answering approximate string queries on large data sets using external memory

Authors:
Alexander Behm;Chen Li;Michael J. Carey
Affiliations:
Deptarment of Computer Science, University of California, Irvine, USA;Deptarment of Computer Science, University of California, Irvine, USA;Deptarment of Computer Science, University of California, Irvine, USA
Venue:
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Year:
2011

Citing 0
Cited 5

ASTERIX: scalable warehouse-style web data integration

Proceedings of the Ninth International Workshop on Information Integration on the Web
Efficient range queries over uncertain strings

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints

Proceedings of the Joint EDBT/ICDT 2013 Workshops
A partition-based method for string similarity joins with edit-distance constraints

ACM Transactions on Database Systems (TODS)
Asymmetric signature schemes for efficient exact edit similarity query processing

ACM Transactions on Database Systems (TODS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

An approximate string query is to find from a collection of strings those that are similar to a given query string. Answering such queries is important in many applications such as data cleaning and record linkage, where errors could occur in queries as well as the data. Many existing algorithms have focused on in-memory indexes. In this paper we investigate how to efficiently answer such queries in a disk-based setting, by systematically studying the effects of storing data and indexes on disk. We devise a novel physical layout for an inverted index to answer queries and we study how to construct it with limited buffer space. To answer queries, we develop a cost-based, adaptive algorithm that balances the I/O costs of retrieving candidate matches and accessing inverted lists. Experiments on large, real datasets verify that simply adapting existing algorithms to a disk-based setting does not work well and that our new techniques answer queries efficiently. Further, our solutions significantly outperform a recent tree-based index, BED-tree.