A partition-based efficient algorithm for large scale multiple-strings matching

  • Authors:
  • Ping Liu;Yan-bing Liu;Jian-long Tan

  • Affiliations:
  • Software Division, Institute of Computing Technology, Chinese Academy of Sciences, Beijing;Software Division, Institute of Computing Technology, Chinese Academy of Sciences, Beijing;Software Division, Institute of Computing Technology, Chinese Academy of Sciences, Beijing

  • Venue:
  • SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Filtering plays an important role in the Internet security and information retrieval fields, and usually employs multiple-strings matching algorithm as its key part. All the classical matching algorithms, however, perform badly when the number of the keywords exceeds a critical point, which made large scale multiple-strings matching problem a great challenge. Based on the observation that the speed of the classical algorithms depends mainly on the length of the shortest keyword, a partition strategy was proposed to decompose the keywords set into a series of subsets on which the classical algorithms was performed. For the optimal partition, it was proved that the keywords with same length locate in one subset, and length of keywords in different subsets would not interlace each other. In this paper, we proposed a shortest-path model for the optimal partition finding problem. Experiments on both random and real data demonstrate that our algorithms generally has about a 100-300% speed-up compared with the classical ones.