Learning deterministic regular expressions for the inference of schemas from XML data

Authors:
Geert Jan Bex;Wouter Gelade;Frank Neven;Stijn Vansummeren
Affiliations:
Hasselt University/Transnational University of Limburg, Diepenbeek, Belgium;Hasselt University/Transnational University of Limburg, Diepenbeek, Belgium;Hasselt University/Transnational University of Limburg, Diepenbeek, Belgium;Hasselt University/Transnational University of Limburg, Diepenbeek, Belgium
Venue:
Proceedings of the 17th international conference on World Wide Web
Year:
2008

Citing 35
Cited 23

Inference of k-Testable Languages in the Strict Sense and Application to Syntactic Pattern Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
Efficient identification of regular expressions from representative examples

COLT '93 Proceedings of the sixth annual conference on Computational learning theory
Regular expressions into finite automata

Theoretical Computer Science
Recent advances of grammatical inference

Theoretical Computer Science - Special issue on algorithmic learning theory
LORE: a Lightweight Object REpository for semistructured data

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
One-unambiguous regular languages

Information and Computation
Characterization of Glushkov automata

Theoretical Computer Science
Stochastic Grammatical Inference of Text Database Structure

Machine Learning
Inductive Inference: Theory and Methods

ACM Computing Surveys (CSUR)
Structural inference for semistructured data

Proceedings of the tenth international conference on Information and knowledge management
StatiX: making XML count

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
XTRACT: Learning Document Type Descriptors from XML Document Collections

Data Mining and Knowledge Discovery
Adding Structure to Unstructured Data

ICDT '97 Proceedings of the 6th International Conference on Database Theory
DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Answering XML Queries on Heterogeneous Data Sources

Proceedings of the 27th International Conference on Very Large Data Bases
Everything You Ever Wanted to Know About DTDs, But Were Afraid to Ask (Extended Abstract)

Selected papers from the Third International Workshop WebDB 2000 on The World Wide Web and Databases
Information Extraction with HMM Structures Learned by Stochastic Optimization

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Inductive Inference, DFAs, and Computational Complexity

AII '89 Proceedings of the International Workshop on Analogical and Inductive Inference
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
The XML web: a first study

WWW '03 Proceedings of the 12th international conference on World Wide Web
DTDs versus XML schema: a practical study

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
XPath satisfiability in the presence of DTDs

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Managing Semi-Structured Data

Queue - Semi-structured Data
Studying the XML Web: Gathering Statistics from an XML Sample

World Wide Web
Taxonomy of XML schema languages using formal language theory

ACM Transactions on Internet Technology (TOIT)
XStruct: Efficient Schema Extraction from Multiple and Large XML Documents

ICDEW '06 Proceedings of the 22nd International Conference on Data Engineering Workshops
Query optimization in XML structured-document databases

The VLDB Journal — The International Journal on Very Large Data Bases
Inference of concise DTDs from XML data

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Expressiveness and complexity of XML Schema

ACM Transactions on Database Systems (TODS)
Introduction to Automata Theory, Languages, and Computation (3rd Edition)

Introduction to Automata Theory, Languages, and Computation (3rd Edition)
Schema-based scheduling of event processors and buffer minimization for queries on structured data streams

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
ShreX: managing XML documents in relational databases

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Inferring XML schema definitions from XML data

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Learning deterministic regular expressions for the inference of schemas from XML data

Proceedings of the 17th international conference on World Wide Web
Algorithms for learning regular expressions

ALT'05 Proceedings of the 16th international conference on Algorithmic Learning Theory

Learning deterministic regular expressions for the inference of schemas from XML data

Proceedings of the 17th international conference on World Wide Web
Curated databases

Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Succinctness of Regular Expressions with Interleaving, Intersection and Counting

MFCS '08 Proceedings of the 33rd international symposium on Mathematical Foundations of Computer Science
On Learning Regular Expressions and Patterns Via Membership and Correction Queries

ICGI '08 Proceedings of the 9th international colloquium on Grammatical Inference: Algorithms and Applications
Long, often quite boring, notes of meetings

Proceedings of the WSDM '09 Workshop on Exploiting Semantic Annotations in Information Retrieval
Simplifying XML schema: effortless handling of nondeterministic regular expressions

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Inference of concise regular expressions and DTDs

ACM Transactions on Database Systems (TODS)
A learning algorithm for top-down XML transformations

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Succinctness of regular expressions with interleaving, intersection and counting

Theoretical Computer Science
Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

ACM Transactions on the Web (TWEB)
Using latent-structure to detect objects on the web

Procceedings of the 13th International Workshop on the Web and Databases
Mobile information exchange and integration: from query to application layer

ADC '09 Proceedings of the Twentieth Australasian Conference on Australasian Database - Volume 92
Ambiguous content and disambiguation of XML schemata

Proceedings of the Fourteenth International Database Engineering & Applications Symposium
Analyzer: a framework for file analysis

DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications
Generating, sampling and counting subclasses of regular tree languages

Proceedings of the 14th International Conference on Database Theory
The quality of the XML web

Proceedings of the 20th ACM international conference on Information and knowledge management
Foundations of regular expressions in XML schema languages and SPARQL

PhD '12 Proceedings of the on SIGMOD/PODS 2012 PhD Symposium
Finding optimal probabilistic generators for XML collections

Proceedings of the 15th International Conference on Database Theory
Schematron schema inference

Proceedings of the 16th International Database Engineering & Applications Sysmposium
Optimizing XML querying using type-based document projection

ACM Transactions on Database Systems (TODS)
Deciding definability by deterministic regular expressions

FOSSACS'13 Proceedings of the 16th international conference on Foundations of Software Science and Computation Structures
The quality of the XML Web

Web Semantics: Science, Services and Agents on the World Wide Web
Server interface descriptions for automated testing of JavaScript web applications

Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirically validated both on real world and synthetic data. Furthermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.