Inferring decision trees using the minimum description length principle
Information and Computation
Algorithms for multilevel logic optimization
Algorithms for multilevel logic optimization
Efficient identification of regular expressions from representative examples
COLT '93 Proceedings of the sixth annual conference on Computational learning theory
Extracting schema from semistructured data
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Storing semistructured data with STORED
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
An Approach to Multilevel Boolean Minimization
Journal of the ACM (JACM)
Stochastic Grammatical Inference of Text Database Structure
Machine Learning
Stochastic Complexity in Statistical Inquiry Theory
Stochastic Complexity in Statistical Inquiry Theory
Optimizing Regular Path Expressions Using Graph Schemas
ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
ICDT '97 Proceedings of the 6th International Conference on Database Theory
Forming Grammars for Structured Documents: an Application of Grammatical Inference
ICGI '94 Proceedings of the Second International Colloquium on Grammatical Inference and Applications
DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Relational Databases for Querying XML Documents: Limitations and Opportunities
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
MDL learning of unions of simple pattern languages from positive examples
EuroCOLT '95 Proceedings of the Second European Conference on Computational Learning Theory
Inductive Inference, DFAs, and Computational Complexity
AII '89 Proceedings of the International Workshop on Analogical and Inductive Inference
Improved Combinatorial Algorithms for the Facility Location and k-Median Problems
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Introduction to Automata Theory, Languages, and Computation (3rd Edition)
Introduction to Automata Theory, Languages, and Computation (3rd Edition)
Graph transformation to infer schemata from XML documents
Proceedings of the 2005 ACM symposium on Applied computing
Study and Development of the DTD Generation System for XML Documents
Programming and Computing Software
A methodology for clustering XML documents by structure
Information Systems
Inference of concise DTDs from XML data
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Inferring XML schema definitions from XML data
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Learning deterministic regular expressions for the inference of schemas from XML data
Proceedings of the 17th international conference on World Wide Web
SchemaScope: a system for inferring and cleaning XML schemas
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Output schemas of XSLT stylesheets and their applications
Information Sciences: an International Journal
Facility Location Problems: A Parameterized View
AAIM '08 Proceedings of the 4th international conference on Algorithmic Aspects in Information and Management
Algorithms for learning regular expressions from positive data
Information and Computation
A methodology for clustering XML documents by structure
Information Systems
Inference of concise regular expressions and DTDs
ACM Transactions on Database Systems (TODS)
Exploring XML web collections with DescribeX
ACM Transactions on the Web (TWEB)
Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data
ACM Transactions on the Web (TWEB)
Using latent-structure to detect objects on the web
Procceedings of the 13th International Workshop on the Web and Databases
Facility location problems: A parameterized view
Discrete Applied Mathematics
Algorithms for learning regular expressions
ALT'05 Proceedings of the 16th international conference on Algorithmic Learning Theory
XSLTGen: a system for automatically generating XML transformations via semantic mappings
Journal on Data Semantics V
MemBeR: a micro-benchmark repository for XQuery
XSym'05 Proceedings of the Third international conference on Database and XML Technologies
Foundations of XML based on logic and automata: a snapshot
FoIKS'12 Proceedings of the 7th international conference on Foundations of Information and Knowledge Systems
Discovering XSD keys from XML data
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Hi-index | 0.00 |
XML is rapidly emerging as the new standard for data representation and exchange on the Web. Unlike HTML, tags in XML documents describe the semantics of the data and not how it is to be displayed. In addition, an XML document can be accompanied by a Document Type Descriptor (DTD) which plays the role of a schema for an XML data collection. DTDs contain valuable information on the structure of documents and thus have a crucial role in the efficient storage of XML data, as well as the effective formulation and optimization of XML queries. Despite their importance, however, DTDs are not mandatory, and it is frequently possible that documents in XML databases will not have accompanying DTDs. In this paper, we propose XTRACT, a novel system for inferring a DTD schema for a database of XML documents. Since the DTD syntax incorporates the full expressive power of regular expressions, naive approaches typically fail to produce concise and intuitive DTDs. Instead, the XTRACT inference algorithms employ a sequence of sophisticated steps that involve: (1) finding patterns in the input sequences and replacing them with regular expressions to generate “general” candidate DTDs, (2) factoring candidate DTDs using adaptations of algorithms from the logic optimization literature, and (3) applying the Minimum Description Length (MDL) principle to find the best DTD among the candidates. The results of our experiments with real-life and synthetic DTDs demonstrate the effectiveness of XTRACT's approach in inferring concise and semantically meaningful DTD schemas for XML databases.