Regular expression matching with multi-strings and intervals

Authors:
Philip Bille;Mikkel Thorup
Affiliations:
Danish Agency for Science, Technology, Innovation;Danish Agency for Science, Technology, Innovation
Venue:
SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
Year:
2010

Citing 17
Cited 7

Compilers: principles, techniques, and tools

Compilers: principles, techniques, and tools
The C programming language

The C programming language
A Four Russians algorithm for regular expression pattern matching

Journal of the ACM (JACM)
A new approach to text searching

Communications of the ACM
Efficient string matching: an aid to bibliographic search

Communications of the ACM
Programming Techniques: Regular expression search algorithm

Communications of the ACM
Extended path expressions of XML

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Deterministic dictionaries

Journal of Algorithms
The C++ Programming Language

The C++ Programming Language
Indexing and Querying XML Data for Regular Path Expressions

Proceedings of the 27th International Conference on Very Large Data Bases
Accurate, scalable in-network identification of p2p traffic using application signatures

Proceedings of the 13th international conference on World Wide Web
Fast and memory-efficient regular expression matching for deep packet inspection

Proceedings of the 2006 ACM/IEEE symposium on Architecture for networking and communications systems
Efficient algorithms for pattern matching with general gaps, character classes, and transposition invariance

Information Retrieval
Fast and compact regular expression matching

Theoretical Computer Science
Nested Counters in Bit-Parallel String Matching

LATA '09 Proceedings of the 3rd International Conference on Language and Automata Theory and Applications
Faster Regular Expression Matching

ICALP '09 Proceedings of the 36th International Colloquium on Automata, Languages and Programming: Part I
New algorithms for regular expression matching

ICALP'06 Proceedings of the 33rd international conference on Automata, Languages and Programming - Volume Part I

The complexity of regular(-like) expressions

DLT'10 Proceedings of the 14th international conference on Developments in language theory
Fast bit-parallel matching for network and regular expressions

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
String matching with variable length gaps

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Online dictionary matching with variable-length gaps

SEA'11 Proceedings of the 10th international conference on Experimental algorithms
Approximate regular expression matching with multi-strings

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
String matching with variable length gaps

Theoretical Computer Science
Approximate regular expression matching with multi-strings

Journal of Discrete Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

Regular expression matching is a key task (and often computational bottleneck) in a variety of software tools and applications. For instance, the standard grep and sed utilities, scripting languages such as perl, internet traffic analysis, XML querying, and protein searching. The basic definition of a regular expression is that we combine characters with union, concatenation, and kleene star operators. The length m is proportional to the number of characters. However, often the initial operation is to concatenate characters in fairly long strings, e.g., if we search for certain combinations of words in a firewall. As a result, the number k of strings in the regular expression is significantly smaller than m. Our main result is a new algorithm that essentially replaces m with k in the complexity bounds for regular expression matching. More precisely, after an O(m log k) time and O(m) space preprocessing of the expression, we can match it in a string presented as a stream of characters in O(k log w/w + log k) time per character, where w is the number w of bits in a memory word. For large w, this corresponds to the previous best bound of O(m log w/w + logm). Prior to this work no O(k) bound per character was known. We further extend our solution to efficiently handle character class interval operators C{x, y}. Here, C is a set of characters and C{x, y}, where x and y are integers such that 0 ≤ x ≤ y, represents a string of length between x and y from C. These character class intervals generalize variable length gaps which are frequently used for pattern matching in computational biology applications.