Fast nGram-based string search over data encoded using algebraic signatures

Authors:
Witold Litwin;Riad Mokadem;Philippe Rigaux;Thomas Schwarz
Affiliations:
Univ. Paris Dauphine;Univ. Paris Dauphine;Univ. Paris Dauphine & INRIA-Orsay, Equipe Gemo;Univ. Santa Clara
Venue:
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Year:
2007

Citing 10
Cited 5

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Text algorithms

Text algorithms
A fast string searching algorithm

Communications of the ACM
Practical Techniques for Searches on Encrypted Data

SP '00 Proceedings of the 2000 IEEE Symposium on Security and Privacy
Algebraic Signatures for Scalable Distributed Data Structures

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Efficient randomized pattern-matching algorithms

IBM Journal of Research and Development - Mathematics and computing
n-gram/2L: a space and time efficient two-level n-gram inverted index structure

VLDB '05 Proceedings of the 31st international conference on Very large data bases
An Encrypted, Content Searchable Scalable Distributed Data Structure

ICDEW '06 Proceedings of the 22nd International Conference on Data Engineering Workshops
Privacy-preserving indexing of documents on the network

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Privacy preserving keyword searches on remote encrypted data

ACNS'05 Proceedings of the Third international conference on Applied Cryptography and Network Security

Reference-based alignment in large sequence databases

Proceedings of the VLDB Endowment
Performance improvement of join queries through algebraic signatures

International Journal of Intelligent Information and Database Systems
WHAM: a high-throughput sequence alignment method

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A generic framework for efficient and effective subsequence retrieval

Proceedings of the VLDB Endowment
WHAM: A High-Throughput Sequence Alignment Method

ACM Transactions on Database Systems (TODS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a novel string search algorithm for data stored once and read many times. Our search method combines the sublinear traversal of the record (as in Boyer Moore or Knuth-Morris-Pratt) with the agglomeration of parts of the record and search pattern into a single character -- the algebraic signature -- in the manner of Karp-Rabin. Our experiments show that our algorithm is up to seventy times faster for DNA data, up to eleven times faster for ASCII, and up to a six times faster for XML documents compared with an implementation of Boyer-Moore. To obtain this speed-up, we store records in encoded form, where each original character is replaced with an algebraic signature. Our method applies to records stored in databases in general and to distributed implementations of a Database As Service (DAS) in particular. Clients send records for insertion and search patterns already in encoded form and servers never operate on records in clear text. No one at a node can involuntarily discover the content of the stored data.