Improving regular-expression matching on strings using negative factors

Authors:
Xiaochun Yang;Bin Wang;Tao Qiu;Yaoshu Wang;Chen Li
Affiliations:
Northeastern University, Shenyang, China;Northeastern University, Shenyang, China;Northeastern University, Shenyang, China;Northeastern University, Shenyang, China;UC Irvine, Irvine, USA
Venue:
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Year:
2013

Citing 8
Cited 0

Fast text searching: allowing errors

Communications of the ACM
Fast text searching for regular expressions or automaton searching on tries

Journal of the ACM (JACM)
Fast and flexible string matching by combining bit-parallelism and suffix automata

Journal of Experimental Algorithmics (JEA)
NR-grep: a fast and flexible pattern-matching tool

Software—Practice & Experience
Introduction To Automata Theory, Languages, And Computation

Introduction To Automata Theory, Languages, And Computation
Compact DFA Representation for Fast Regular Expression Search

WAE '01 Proceedings of the 5th International Workshop on Algorithm Engineering
A New Regular Grammar Pattern Matching Algorithm

ESA '96 Proceedings of the Fourth Annual European Symposium on Algorithms
Compressed indexing and local alignment of DNA

Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

The problem of finding matches of a regular expression (RE) on a string exists in many applications such as text editing, biosequence search, and shell commands. Existing techniques first identify candidates using substrings in the RE, then verify each of them using an automaton. These techniques become inefficient when there are many candidate occurrences that need to be verified. In this paper we propose a novel technique that prunes false negatives by utilizing negative factors, which are substrings that cannot appear in an answer. A main advantage of the technique is that it can be integrated with many existing algorithms to improve their efficiency significantly. We give a full specification of this technique. We develop an efficient algorithm that utilizes negative factors to prune candidates, then improve it by using bit operations to process negative factors in parallel. We show that negative factors, when used together with necessary factors (substrings that must appear in each answer), can achieve much better pruning power. We analyze the large number of negative factors, and develop an algorithm for finding a small number of high-quality negative factors. We conducted a thorough experimental study of this technique on real data sets, including DNA sequences, proteins, and text documents, and show the significant performance improvement when applying the technique in existing algorithms. For instance, it improved the search speed of the popular Gnu Grep tool by 11 to 74 times for text documents.