Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

Authors:
Geert Jan Bex;Wouter Gelade;Frank Neven;Stijn Vansummeren
Affiliations:
Hasselt University and Transnational University of Limburg;Hasselt University and Transnational University of Limburg;Hasselt University and Transnational University of Limburg;Université Libre de Bruxelles
Venue:
ACM Transactions on the Web (TWEB)
Year:
2010

Citing 37
Cited 7

Inference of k-Testable Languages in the Strict Sense and Application to Syntactic Pattern Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
Efficient identification of regular expressions from representative examples

COLT '93 Proceedings of the sixth annual conference on Computational learning theory
Regular expressions into finite automata

Theoretical Computer Science
Recent advances of grammatical inference

Theoretical Computer Science - Special issue on algorithmic learning theory
Extracting schema from semistructured data

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
LORE: a Lightweight Object REpository for semistructured data

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
One-unambiguous regular languages

Information and Computation
Stochastic Grammatical Inference of Text Database Structure

Machine Learning
Inductive Inference: Theory and Methods

ACM Computing Surveys (CSUR)
Structural inference for semistructured data

Proceedings of the tenth international conference on Information and knowledge management
StatiX: making XML count

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
XTRACT: Learning Document Type Descriptors from XML Document Collections

Data Mining and Knowledge Discovery
Adding Structure to Unstructured Data

ICDT '97 Proceedings of the 6th International Conference on Database Theory
DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Answering XML Queries on Heterogeneous Data Sources

Proceedings of the 27th International Conference on Very Large Data Bases
Everything You Ever Wanted to Know About DTDs, But Were Afraid to Ask (Extended Abstract)

Selected papers from the Third International Workshop WebDB 2000 on The World Wide Web and Databases
Information Extraction with HMM Structures Learned by Stochastic Optimization

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Inductive Inference, DFAs, and Computational Complexity

AII '89 Proceedings of the International Workshop on Analogical and Inductive Inference
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
The XML web: a first study

WWW '03 Proceedings of the 12th international conference on World Wide Web
DTDs versus XML schema: a practical study

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
XPath satisfiability in the presence of DTDs

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Managing Semi-Structured Data

Queue - Semi-structured Data
Studying the XML Web: Gathering Statistics from an XML Sample

World Wide Web
XStruct: Efficient Schema Extraction from Multiple and Large XML Documents

ICDEW '06 Proceedings of the 22nd International Conference on Data Engineering Workshops
Query optimization in XML structured-document databases

The VLDB Journal — The International Journal on Very Large Data Bases
Inference of concise DTDs from XML data

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Expressiveness and complexity of XML Schema

ACM Transactions on Database Systems (TODS)
Schema-based scheduling of event processors and buffer minimization for queries on structured data streams

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
ShreX: managing XML documents in relational databases

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Inferring XML schema definitions from XML data

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Learning deterministic regular expressions for the inference of schemas from XML data

Proceedings of the 17th international conference on World Wide Web
Finite Automata, Digraph Connectivity, and Regular Expression Size

ICALP '08 Proceedings of the 35th international colloquium on Automata, Languages and Programming, Part II
Introduction to Automata Theory, Languages, and Computation

Introduction to Automata Theory, Languages, and Computation
Inference of concise regular expressions and DTDs

ACM Transactions on Database Systems (TODS)
Complexity measures for regular expressions

Journal of Computer and System Sciences
Algorithms for learning regular expressions

ALT'05 Proceedings of the 16th international conference on Algorithmic Learning Theory

Deterministic regular expressions in linear time

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Foundations of XML based on logic and automata: a snapshot

FoIKS'12 Proceedings of the 7th international conference on Foundations of Information and Knowledge Systems
Learning twig and path queries

Proceedings of the 15th International Conference on Database Theory
Fast learning of restricted regular expressions and DTDs

Proceedings of the 16th International Conference on Database Theory
Definability problems for graph query languages

Proceedings of the 16th International Conference on Database Theory
Discovering XSD keys from XML data

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Learning regular expressions to template-based FAQ retrieval systems

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of deterministic regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the deterministic one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirically validated both on real world and synthetic data. Furthermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.