IEEE Transactions on Pattern Analysis and Machine Intelligence
Efficient identification of regular expressions from representative examples
COLT '93 Proceedings of the sixth annual conference on Computational learning theory
Regular expressions into finite automata
Theoretical Computer Science
Recent advances of grammatical inference
Theoretical Computer Science - Special issue on algorithmic learning theory
LORE: a Lightweight Object REpository for semistructured data
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
One-unambiguous regular languages
Information and Computation
Characterization of Glushkov automata
Theoretical Computer Science
Stochastic Grammatical Inference of Text Database Structure
Machine Learning
Inductive Inference: Theory and Methods
ACM Computing Surveys (CSUR)
Structural inference for semistructured data
Proceedings of the tenth international conference on Information and knowledge management
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
XTRACT: Learning Document Type Descriptors from XML Document Collections
Data Mining and Knowledge Discovery
Adding Structure to Unstructured Data
ICDT '97 Proceedings of the 6th International Conference on Database Theory
DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Answering XML Queries on Heterogeneous Data Sources
Proceedings of the 27th International Conference on Very Large Data Bases
Everything You Ever Wanted to Know About DTDs, But Were Afraid to Ask (Extended Abstract)
Selected papers from the Third International Workshop WebDB 2000 on The World Wide Web and Databases
Information Extraction with HMM Structures Learned by Stochastic Optimization
Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Inductive Inference, DFAs, and Computational Complexity
AII '89 Proceedings of the International Workshop on Analogical and Inductive Inference
A survey of approaches to automatic schema matching
The VLDB Journal — The International Journal on Very Large Data Bases
WWW '03 Proceedings of the 12th international conference on World Wide Web
DTDs versus XML schema: a practical study
Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
XPath satisfiability in the presence of DTDs
Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Queue - Semi-structured Data
Taxonomy of XML schema languages using formal language theory
ACM Transactions on Internet Technology (TOIT)
XStruct: Efficient Schema Extraction from Multiple and Large XML Documents
ICDEW '06 Proceedings of the 22nd International Conference on Data Engineering Workshops
Query optimization in XML structured-document databases
The VLDB Journal — The International Journal on Very Large Data Bases
Inference of concise DTDs from XML data
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Expressiveness and complexity of XML Schema
ACM Transactions on Database Systems (TODS)
Introduction to Automata Theory, Languages, and Computation (3rd Edition)
Introduction to Automata Theory, Languages, and Computation (3rd Edition)
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
ShreX: managing XML documents in relational databases
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Inferring XML schema definitions from XML data
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Learning deterministic regular expressions for the inference of schemas from XML data
Proceedings of the 17th international conference on World Wide Web
Algorithms for learning regular expressions
ALT'05 Proceedings of the 16th international conference on Algorithmic Learning Theory
Learning deterministic regular expressions for the inference of schemas from XML data
Proceedings of the 17th international conference on World Wide Web
Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Succinctness of Regular Expressions with Interleaving, Intersection and Counting
MFCS '08 Proceedings of the 33rd international symposium on Mathematical Foundations of Computer Science
On Learning Regular Expressions and Patterns Via Membership and Correction Queries
ICGI '08 Proceedings of the 9th international colloquium on Grammatical Inference: Algorithms and Applications
Long, often quite boring, notes of meetings
Proceedings of the WSDM '09 Workshop on Exploiting Semantic Annotations in Information Retrieval
Simplifying XML schema: effortless handling of nondeterministic regular expressions
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Inference of concise regular expressions and DTDs
ACM Transactions on Database Systems (TODS)
A learning algorithm for top-down XML transformations
Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Succinctness of regular expressions with interleaving, intersection and counting
Theoretical Computer Science
Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data
ACM Transactions on the Web (TWEB)
Using latent-structure to detect objects on the web
Procceedings of the 13th International Workshop on the Web and Databases
Mobile information exchange and integration: from query to application layer
ADC '09 Proceedings of the Twentieth Australasian Conference on Australasian Database - Volume 92
Ambiguous content and disambiguation of XML schemata
Proceedings of the Fourteenth International Database Engineering & Applications Symposium
Analyzer: a framework for file analysis
DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications
Generating, sampling and counting subclasses of regular tree languages
Proceedings of the 14th International Conference on Database Theory
Proceedings of the 20th ACM international conference on Information and knowledge management
Foundations of regular expressions in XML schema languages and SPARQL
PhD '12 Proceedings of the on SIGMOD/PODS 2012 PhD Symposium
Finding optimal probabilistic generators for XML collections
Proceedings of the 15th International Conference on Database Theory
Proceedings of the 16th International Database Engineering & Applications Sysmposium
Optimizing XML querying using type-based document projection
ACM Transactions on Database Systems (TODS)
Deciding definability by deterministic regular expressions
FOSSACS'13 Proceedings of the 16th international conference on Foundations of Software Science and Computation Structures
Web Semantics: Science, Services and Agents on the World Wide Web
Server interface descriptions for automated testing of JavaScript web applications
Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering
Hi-index | 0.00 |
Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirically validated both on real world and synthetic data. Furthermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.