RE-Tree: an efficient index structure for regular expressions

Authors:
Chee-Yong Chan;Minos Garofalakis;Rajeev Rastogi
Affiliations:
Bell Labs, Lucent Technologies;Bell Labs, Lucent Technologies;Bell Labs, Lucent Technologies
Venue:
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Year:
2002

Citing 16
Cited 6

Inferring decision trees using the minimum description length principle

Information and Computation
The R*-tree: an efficient and robust access method for points and rectangles

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Fast text searching: allowing errors

Communications of the ACM
Evaluation of signature files as set access facilities in OODBs

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Fast text searching for regular expressions or automaton searching on tries

Journal of the ACM (JACM)
Storing semistructured data with STORED

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Counting and random generation of strings in regular languages

Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms
XTRACT: a system for extracting document type descriptors from XML documents

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
BGP4: Inter-Domain Routing in the Internet

BGP4: Inter-Domain Routing in the Internet
Introduction To Automata Theory, Languages, And Computation

Introduction To Automata Theory, Languages, And Computation
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Efficient Filtering of XML Documents for Selective Dissemination of Information

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
MDL learning of unions of simple pattern languages from positive examples

EuroCOLT '95 Proceedings of the Second European Conference on Computational Learning Theory
YFilter: Efficient and Scalable Filtering of XML Documents

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Efficient Filtering of XML Documents with XPath Expressions

ICDE '02 Proceedings of the 18th International Conference on Data Engineering

RE-tree: an efficient index structure for regular expressions

The VLDB Journal — The International Journal on Very Large Data Bases
Summary-based routing for content-based event distribution networks

ACM SIGCOMM Computer Communication Review
Clustering and indexing of experience sequences for popularity-driven recommendations

Proceedings of the 3rd ACM workshop on Continuous archival and retrival of personal experences
No Code Required: Giving Users Tools to Transform the Web

No Code Required: Giving Users Tools to Transform the Web
SPiDeR: P2P-based web service discovery

ICSOC'05 Proceedings of the Third international conference on Service-Oriented Computing
Extending XML with nonmonotonic multiple inheritance

DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Due to their expressive power, Regular Expressions (REs) are quickly becoming an integral part of language specifications for several important application scenarios. Many of these applications have to manage huge databases of RE specifications and need to provide an effective matching mechanism that, given an input string, quickly identifies the REs in the database that match it. In this paper, we propose the RE-tree, a novel index structure for large databases of RE specifications. Given an input query string, the RE-tree speeds up the retrieval of matching REs by focusing the search and comparing the input string with only a small fraction of REs in the database. Even though the RE-tree is similar in spirit to other tree-based structures that have been proposed for indexing multi-dimensional data, RE indexing is significantly more challenging since REs typically represent infinite sets of strings with no well-defined notion of spatial locality. To address these new challenges, our RE-tree index structure relies on novel measures for comparing the relative sizes of infinite regular languages. We also propose innovative solutions for the various RE-tree operations, including the effective splitting of RE-tree nodes and computing a "tight" bounding RE for a collection of REs. Finally, we demonstrate how sampling-based approximation algorithms can be used to significantly speed up the performance of RE-tree operations. Our experimental results with synthetic data sets indicate that the REtree is very effective in pruning the search space and easily outperforms naive sequential search approaches.