An algorithm for approximate membership checking with application to password security
Information Processing Letters
Dictionary look-up with one error
Journal of Algorithms
Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Improved bounds for dictionary look-up with one error
Information Processing Letters
Approximate Dictionary Queries
CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
Locality-sensitive hashing scheme based on p-stable distributions
SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
Dictionary matching and indexing with errors and don't cares
STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Efficient exact set-similarity joins
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
SpotSigs: robust and efficient near duplicate detection in large web collections
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Efficient Merging and Filtering Algorithms for Approximate String Searches
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Large scale Hamming distance query processing
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Locality-sensitive hashing scheme based on dynamic collision counting
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Fast search in Hamming space with multi-index hashing
CVPR '12 Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Hi-index | 0.00 |
Hamming distance measures the number of dimensions where two vectors have different values. In applications such as pattern recognition, information retrieval, and databases, we often need to efficiently process Hamming distance query, which retrieves vectors in a database that have no more than k Hamming distance from a given query vector. Existing work on efficient Hamming distance query processing has some of the following limitations, such as only applicable to tiny error threshold values, unable to deal with vectors where the value domain is large, or unable to attain robust performance in the presence of data skew. In this paper, we propose HmSearch, an efficient query processing method for Hamming distance queries that addresses the above-mentioned limitations. Our method is based on improved enumeration-based signatures, enhanced filtering, and the hierarchical binary filtering-and-verification. We also design an effective dimension rearrangement method to deal with data skew. Extensive experimental results demonstrate that our methods outperform state-of-the-art methods by up to two orders of magnitude.