String matching on the internet

  • Authors:
  • Hervé Brönnimann;Nasir Memon;Kulesh Shanmugasundaram

  • Affiliations:
  • Polytechnic University, NY;Polytechnic University, NY;Polytechnic University, NY

  • Venue:
  • CAAN'04 Proceedings of the First international conference on Combinatorial and Algorithmic Aspects of Networking
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

We consider a variant of the “string searching in database” problem where the string database comes on a data stream, and processing the data is at a premium but querying is not a runtime bottleneck. Speci.cally, the strings to be searched into (let's call them the documents) have to be processed online very e.ciently, meaning the documents have to be added to some string searching data structure one by one in time proportional to their length. Of course, we desire this data structure to be small, i.e. at most linear space, and hopefully exhibit a tradeo. between storage/processing cost and accuracy. Upon some query string, the data structure must return whether that string is contained in a document (the presence query), and must also be able to return a list of the documents which contain the query (the attribution query). We may require that the query be large enough and that only portions of it may match (pattern matching). In practice, it is acceptable that the data structure return a superset of the answer, as long as no document from the answer is missing and there are only few false positives; either the false positives can be .ltered (by actual veri.cation if the document texts are available in a repository), or a small number of false positives are acceptable for the application (e.g. network forensics, see below).