Inference of concise regular expressions and DTDs

  • Authors:
  • Geert Jan Bex;Frank Neven;Thomas Schwentick;Stijn Vansummeren

  • Affiliations:
  • Hasselt University and Transnational University of Limburg, Belgium;Hasselt University and Transnational University of Limburg, Belgium;Dortmund University, Germany;Université Libre de Bruxelles, Belgium

  • Venue:
  • ACM Transactions on Database Systems (TODS)
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

We consider the problem of inferring a concise Document Type Definition (DTD) for a given set of XML-documents, a problem that basically reduces to learning concise regular expressions from positive examples strings. We identify two classes of concise regular expressions—the single occurrence regular expressions (SOREs) and the chain regular expressions (CHAREs)—that capture the far majority of expressions used in practical DTDs. For the inference of SOREs we present several algorithms that first infer an automaton for a given set of example strings and then translate that automaton to a corresponding SORE, possibly repairing the automaton when no equivalent SORE can be found. In the process, we introduce a novel automaton to regular expression rewrite technique which is of independent interest. When only a very small amount of XML data is available, however (for instance when the data is generated by Web service requests or by answers to queries), these algorithms produce regular expressions that are too specific. Therefore, we introduce a novel learning algorithm crx that directly infers CHAREs (which form a subclass of SOREs) without going through an automaton representation. We show that crx performs very well within its target class on very small datasets.