Inference of concise regular expressions and DTDs

Authors:
Geert Jan Bex;Frank Neven;Thomas Schwentick;Stijn Vansummeren
Affiliations:
Hasselt University and Transnational University of Limburg, Belgium;Hasselt University and Transnational University of Limburg, Belgium;Dortmund University, Germany;Université Libre de Bruxelles, Belgium
Venue:
ACM Transactions on Database Systems (TODS)
Year:
2010

Citing 43
Cited 17

Inference of k-Testable Languages in the Strict Sense and Application to Syntactic Pattern Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
Efficient identification of regular expressions from representative examples

COLT '93 Proceedings of the sixth annual conference on Computational learning theory
Regular expressions into finite automata

Theoretical Computer Science
Lore: a database management system for semistructured data

ACM SIGMOD Record
Recent advances of grammatical inference

Theoretical Computer Science - Special issue on algorithmic learning theory
Extracting schema from semistructured data

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
One-unambiguous regular languages

Information and Computation
Storing semistructured data with STORED

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Confluent Reductions: Abstract Properties and Applications to Term Rewriting Systems: Abstract Properties and Applications to Term Rewriting Systems

Journal of the ACM (JACM)
Characterization of Glushkov automata

Theoretical Computer Science
Inductive Inference: Theory and Methods

ACM Computing Surveys (CSUR)
Implementing conditional term rewriting by graph rewriting

Theoretical Computer Science
Structural inference for semistructured data

Proceedings of the tenth international conference on Information and knowledge management
Introduction To Automata Theory, Languages, And Computation

Introduction To Automata Theory, Languages, And Computation
XTRACT: Learning Document Type Descriptors from XML Document Collections

Data Mining and Knowledge Discovery
Efficient extraction of schemas for XML documents

Information Processing Letters
Representative Objects: Concise Representations of Semistructured, Hierarchial Data

ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
Optimizing Regular Path Expressions Using Graph Schemas

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Adding Structure to Unstructured Data

ICDT '97 Proceedings of the 6th International Conference on Database Theory
DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Answering XML Queries on Heterogeneous Data Sources

Proceedings of the 27th International Conference on Very Large Data Bases
Everything You Ever Wanted to Know About DTDs, But Were Afraid to Ask (Extended Abstract)

Selected papers from the Third International Workshop WebDB 2000 on The World Wide Web and Databases
Inductive Inference, DFAs, and Computational Complexity

AII '89 Proceedings of the International Workshop on Analogical and Inductive Inference
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
The XML web: a first study

WWW '03 Proceedings of the 12th international conference on World Wide Web
Generic Model Management: Concepts And Algorithms (Lecture Notes in Computer Science)

Generic Model Management: Concepts And Algorithms (Lecture Notes in Computer Science)
DTDs versus XML schema: a practical study

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Managing Semi-Structured Data

Queue - Semi-structured Data
Automatic Discovery and Inferencing of Complex Bioinformatics Web Interfaces

World Wide Web
XStruct: Efficient Schema Extraction from Multiple and Large XML Documents

ICDEW '06 Proceedings of the 22nd International Conference on Data Engineering Workshops
Studying the XML Web: Gathering Statistics from an XML Sample

World Wide Web
Inference of concise DTDs from XML data

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Expressiveness and complexity of XML Schema

ACM Transactions on Database Systems (TODS)
Obtaining shorter regular expressions from finite-state automata

Theoretical Computer Science
Guided interaction: A mechanism to enable ad hoc service interaction

Information Systems Frontiers
Schema-based scheduling of event processors and buffer minimization for queries on structured data streams

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Simple off the shelf abstractions for XML schema

ACM SIGMOD Record
Inferring XML schema definitions from XML data

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
XPath satisfiability in the presence of DTDs

Journal of the ACM (JACM)
Learning deterministic regular expressions for the inference of schemas from XML data

Proceedings of the 17th international conference on World Wide Web
Algorithms for learning regular expressions from positive data

Information and Computation
Complexity measures for regular expressions

Journal of Computer and System Sciences
Approximation to the smallest regular expression for a given regular language

CIAA'04 Proceedings of the 9th international conference on Implementation and Application of Automata

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

ACM Transactions on the Web (TWEB)
Generating, sampling and counting subclasses of regular tree languages

Proceedings of the 14th International Conference on Database Theory
Enabling information extraction by inference of regular expressions from sample entities

Proceedings of the 20th ACM international conference on Information and knowledge management
Succinctness of the Complement and Intersection of Regular Expressions

ACM Transactions on Computational Logic (TOCL)
Deterministic regular expressions in linear time

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
The complexity of evaluating path expressions in SPARQL

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Foundations of regular expressions in XML schema languages and SPARQL

PhD '12 Proceedings of the on SIGMOD/PODS 2012 PhD Symposium
Foundations of XML based on logic and automata: a snapshot

FoIKS'12 Proceedings of the 7th international conference on Foundations of Information and Knowledge Systems
Learning twig and path queries

Proceedings of the 15th International Conference on Database Theory
Type-based detection of XML query-update independence

Proceedings of the VLDB Endowment
Schematron schema inference

Proceedings of the 16th International Database Engineering & Applications Sysmposium
Developing and analyzing XSDs through BonXai

Proceedings of the VLDB Endowment
Consistency and repair for XML write-access control policies

The VLDB Journal — The International Journal on Very Large Data Bases
Fast learning of restricted regular expressions and DTDs

Proceedings of the 16th International Conference on Database Theory
Definability problems for graph query languages

Proceedings of the 16th International Conference on Database Theory
Discovering XSD keys from XML data

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
The complexity of regular expressions and property paths in SPARQL

ACM Transactions on Database Systems (TODS) - Invited papers issue

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of inferring a concise Document Type Definition (DTD) for a given set of XML-documents, a problem that basically reduces to learning concise regular expressions from positive examples strings. We identify two classes of concise regular expressions—the single occurrence regular expressions (SOREs) and the chain regular expressions (CHAREs)—that capture the far majority of expressions used in practical DTDs. For the inference of SOREs we present several algorithms that first infer an automaton for a given set of example strings and then translate that automaton to a corresponding SORE, possibly repairing the automaton when no equivalent SORE can be found. In the process, we introduce a novel automaton to regular expression rewrite technique which is of independent interest. When only a very small amount of XML data is available, however (for instance when the data is generated by Web service requests or by answers to queries), these algorithms produce regular expressions that are too specific. Therefore, we introduce a novel learning algorithm crx that directly infers CHAREs (which form a subclass of SOREs) without going through an automaton representation. We show that crx performs very well within its target class on very small datasets.