Scalable regular expression matching on data streams

  • Authors:
  • Anirban Majumder;Rajeev Rastogi;Sriram Vanama

  • Affiliations:
  • Bell Labs Research India, India;Yahoo!, Bangalore, India;Indian Institute of Technology, Madras, Chennai, India

  • Venue:
  • Proceedings of the 2008 ACM SIGMOD international conference on Management of data
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Regular Expression (RE) matching has important applications in the areas of XML content distribution and network security. In this paper, we present the end-to-end design of a high performance RE matching system. Our system combines the processing efficiency of Deterministic Finite Automata (DFA) with the space efficiency of Non-deterministic Finite Automata (NFA) to scale to hundreds of REs. In experiments with real-life RE data on data streams, we found that a bulk of the DFA transitions are concentrated around a few DFA states. We exploit this fact to cache only the frequent core of each DFA in memory as opposed to the entire DFA (which may be exponential in size). Further, we cluster REs such that REs whose interactions cause an exponential increase in the number of states are assigned to separate groups -- this helps to improve cache hits by controlling the overall DFA size. To the best of our knowledge, ours is the first end-to-end system capable of matching REs at high speeds and in their full generality. Through a clever combination of RE grouping, and static and dynamic caching, it is able to perform RE matching at high speeds, even in the presence of limited memory. Through experiments with real-life data sets, we show that our RE matching system convincingly outperforms a state-of-the-art Network Intrusion Detection tool with support for efficient RE matching.