Finding optimal probabilistic generators for XML collections

Authors:
Serge Abiteboul;Yael Amsterdamer;Daniel Deutch;Tova Milo;Pierre Senellart
Affiliations:
INRIA Saclay, ENS Cachan;INRIA Saclay, Tel Aviv University;Ben Gurion University, INRIA Saclay, ENS Cachan;Tel Aviv University;Institut Té/lé/com/ Té/lé/com ParisTech, CNRS LTCI
Venue:
Proceedings of the 15th International Conference on Database Theory
Year:
2012

Citing 26
Cited 4

Extracting schema from semistructured data

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Type inference for queries on semistructured data

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
XTRACT: a system for extracting document type descriptors from XML documents

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
On XML integrity constraints in the presence of DTDs

Journal of the ACM (JACM)
Discovering approximate keys in XML data

Proceedings of the eleventh international conference on Information and knowledge management
Estimation of probabilistic context-free grammars

Computational Linguistics
Taxonomy of XML schema languages using formal language theory

ACM Transactions on Internet Technology (TOIT)
Information extraction from structured documents using k-testable tree automaton inference

Data & Knowledge Engineering
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
Inference of concise DTDs from XML data

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Expressiveness and complexity of XML Schema

ACM Transactions on Database Systems (TODS)
On the minimization of XML Schemas and tree automata for unranked trees

Journal of Computer and System Sciences
Simple off the shelf abstractions for XML schema

ACM SIGMOD Record
Inferring XML schema definitions from XML data

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Learning deterministic regular expressions for the inference of schemas from XML data

Proceedings of the 17th international conference on World Wide Web
Incorporating constraints in probabilistic XML

Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
The Active XML project: an overview

The VLDB Journal — The International Journal on Very Large Data Bases
Generating XML structure using examples and constraints

Proceedings of the VLDB Endowment
Recursive Markov chains, stochastic grammars, and monotone systems of nonlinear equations

Journal of the ACM (JACM)
On the expressiveness of probabilistic XML models

The VLDB Journal — The International Journal on Very Large Data Bases
The AXML Artifact Model

TIME '09 Proceedings of the 2009 16th International Symposium on Temporal Representation and Reasoning
Aggregate queries for discrete and continuous probabilistic XML

Proceedings of the 13th International Conference on Database Theory
Simplifying XML schema: single-type approximations of regular tree languages

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Probabilistic XML via Markov Chains

Proceedings of the VLDB Endowment
Efficient reasoning about data trees via integer linear programming

Proceedings of the 14th International Conference on Database Theory
Generating, sampling and counting subclasses of regular tree languages

Proceedings of the 14th International Conference on Database Theory

The ERC webdam on foundations of web data management

Proceedings of the 21st international conference companion on World Wide Web
Auto-completion learning for XML

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Discovering XSD keys from XML data

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
On the connections between relational and XML probabilistic data models

BNCOD'13 Proceedings of the 29th British National conference on Big Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study the problem of, given a corpus of XML documents and its schema, finding an optimal (generative) probabilistic model, where optimality here means maximizing the likelihood of the particular corpus to be generated. Focusing first on the structure of documents, we present an efficient algorithm for finding the best generative probabilistic model, in the absence of constraints. We further study the problem in the presence of integrity constraints, namely key, inclusion, and domain constraints. We study in this case two different kinds of generators. First, we consider a continuation-test generator that performs, while generating documents, tests of schema satisfiability; these tests prevent from generating a document violating the constraints but, as we will see, they are computationally expensive. We also study a restart generator that may generate an invalid document and, when this is the case, restarts and tries again. Finally, we consider the injection of data values into the structure, to obtain a full XML document. We study different approaches for generating these values.