Scalable regular expression matching on data streams

Authors:
Anirban Majumder;Rajeev Rastogi;Sriram Vanama
Affiliations:
Bell Labs Research India, India;Yahoo!, Bangalore, India;Indian Institute of Technology, Madras, Chennai, India
Venue:
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Year:
2008

Citing 12
Cited 10

Approximation algorithms for directed Steiner problems

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Efficient string matching: an aid to bibliographic search

Communications of the ACM
Mesh-based content routing using XML

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
A String Matching Algorithm Fast on the Average

Proceedings of the 6th Colloquium, on Automata, Languages and Programming
One-dimensional and multi-dimensional substring selectivity estimation

The VLDB Journal — The International Journal on Very Large Data Bases
Enhancing byte-level network intrusion detection signatures with context

Proceedings of the 10th ACM conference on Computer and communications security
RE-tree: an efficient index structure for regular expressions

The VLDB Journal — The International Journal on Very Large Data Bases
Processing XML streams with deterministic automata and stream indexes

ACM Transactions on Database Systems (TODS)
Algorithms to accelerate multiple regular expressions matching for deep packet inspection

Proceedings of the 2006 conference on Applications, technologies, architectures, and protocols for computer communications
Introduction to Automata Theory, Languages, and Computation (3rd Edition)

Introduction to Automata Theory, Languages, and Computation (3rd Edition)
Fast and memory-efficient regular expression matching for deep packet inspection

Proceedings of the 2006 ACM/IEEE symposium on Architecture for networking and communications systems
Towards an internet-scale XML dissemination service

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

ZStream: a cost-based query processor for adaptively detecting composite events

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
A scalable, predictable join operator for highly concurrent data warehouses

Proceedings of the VLDB Endowment
2-layer erroneous-plan recognition for dementia patients in smart homes

Healthcom'09 Proceedings of the 11th international conference on e-Health networking, applications and services
Online constrained pattern detection over streams

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 5
The architecture and implementation of an extensible web crawler

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Co-match: fast and efficient packet inspection for multiple flows

Proceedings of the 5th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
High-performance dynamic pattern matching over disordered streams

Proceedings of the VLDB Endowment
SigMatch: fast and scalable multi-pattern matching

Proceedings of the VLDB Endowment
Compressing regular expressions' DFA table by matrix decomposition

CIAA'10 Proceedings of the 15th international conference on Implementation and application of automata
Predictable performance and high query concurrency for data analytics

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Regular Expression (RE) matching has important applications in the areas of XML content distribution and network security. In this paper, we present the end-to-end design of a high performance RE matching system. Our system combines the processing efficiency of Deterministic Finite Automata (DFA) with the space efficiency of Non-deterministic Finite Automata (NFA) to scale to hundreds of REs. In experiments with real-life RE data on data streams, we found that a bulk of the DFA transitions are concentrated around a few DFA states. We exploit this fact to cache only the frequent core of each DFA in memory as opposed to the entire DFA (which may be exponential in size). Further, we cluster REs such that REs whose interactions cause an exponential increase in the number of states are assigned to separate groups -- this helps to improve cache hits by controlling the overall DFA size. To the best of our knowledge, ours is the first end-to-end system capable of matching REs at high speeds and in their full generality. Through a clever combination of RE grouping, and static and dynamic caching, it is able to perform RE matching at high speeds, even in the presence of limited memory. Through experiments with real-life data sets, we show that our RE matching system convincingly outperforms a state-of-the-art Network Intrusion Detection tool with support for efficient RE matching.