On the use of regular expressions for searching text

  • Authors:
  • Charles L. A. Clarke;Gordon V. Cormack

  • Affiliations:
  • Univ. of Waterloo, Waterloo, Ont., Canada;Univ. of Waterloo, Waterloo, Ont., Canada

  • Venue:
  • ACM Transactions on Programming Languages and Systems (TOPLAS)
  • Year:
  • 1997

Quantified Score

Hi-index 0.00

Visualization

Abstract

The use of regular expressions for text search is widely known and well understood. It is then surprising that the standard techniques and tools prove to be of limited use for searching structured text formatted with SGML or similar markup languages. Our experience with structured text search has caused us to reexamine the current practice. The generally accepted rule of “leftmost longest match” is an unfortunate choice and is at the root of the difficulties. We instead propose a rule which is semantically cleaner. This rule is generally applicable to a variety of text search applications, including source code analysis, and has interesting properties in its own right. We have written a publicly available search tool implementing the theory in the article, which has proved valuable in a variety of circumstances.