IEEE Transactions on Pattern Analysis and Machine Intelligence
Efficient identification of regular expressions from representative examples
COLT '93 Proceedings of the sixth annual conference on Computational learning theory
Regular expressions into finite automata
Theoretical Computer Science
Recent advances of grammatical inference
Theoretical Computer Science - Special issue on algorithmic learning theory
Extracting schema from semistructured data
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
LORE: a Lightweight Object REpository for semistructured data
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
One-unambiguous regular languages
Information and Computation
Stochastic Grammatical Inference of Text Database Structure
Machine Learning
Inductive Inference: Theory and Methods
ACM Computing Surveys (CSUR)
Structural inference for semistructured data
Proceedings of the tenth international conference on Information and knowledge management
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
XTRACT: Learning Document Type Descriptors from XML Document Collections
Data Mining and Knowledge Discovery
Adding Structure to Unstructured Data
ICDT '97 Proceedings of the 6th International Conference on Database Theory
DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Answering XML Queries on Heterogeneous Data Sources
Proceedings of the 27th International Conference on Very Large Data Bases
Everything You Ever Wanted to Know About DTDs, But Were Afraid to Ask (Extended Abstract)
Selected papers from the Third International Workshop WebDB 2000 on The World Wide Web and Databases
Information Extraction with HMM Structures Learned by Stochastic Optimization
Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Inductive Inference, DFAs, and Computational Complexity
AII '89 Proceedings of the International Workshop on Analogical and Inductive Inference
A survey of approaches to automatic schema matching
The VLDB Journal — The International Journal on Very Large Data Bases
WWW '03 Proceedings of the 12th international conference on World Wide Web
DTDs versus XML schema: a practical study
Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
XPath satisfiability in the presence of DTDs
Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Queue - Semi-structured Data
XStruct: Efficient Schema Extraction from Multiple and Large XML Documents
ICDEW '06 Proceedings of the 22nd International Conference on Data Engineering Workshops
Query optimization in XML structured-document databases
The VLDB Journal — The International Journal on Very Large Data Bases
Inference of concise DTDs from XML data
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Expressiveness and complexity of XML Schema
ACM Transactions on Database Systems (TODS)
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
ShreX: managing XML documents in relational databases
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Inferring XML schema definitions from XML data
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Learning deterministic regular expressions for the inference of schemas from XML data
Proceedings of the 17th international conference on World Wide Web
Finite Automata, Digraph Connectivity, and Regular Expression Size
ICALP '08 Proceedings of the 35th international colloquium on Automata, Languages and Programming, Part II
Introduction to Automata Theory, Languages, and Computation
Introduction to Automata Theory, Languages, and Computation
Inference of concise regular expressions and DTDs
ACM Transactions on Database Systems (TODS)
Complexity measures for regular expressions
Journal of Computer and System Sciences
Algorithms for learning regular expressions
ALT'05 Proceedings of the 16th international conference on Algorithmic Learning Theory
Deterministic regular expressions in linear time
PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Foundations of XML based on logic and automata: a snapshot
FoIKS'12 Proceedings of the 7th international conference on Foundations of Information and Knowledge Systems
Learning twig and path queries
Proceedings of the 15th International Conference on Database Theory
Fast learning of restricted regular expressions and DTDs
Proceedings of the 16th International Conference on Database Theory
Definability problems for graph query languages
Proceedings of the 16th International Conference on Database Theory
Discovering XSD keys from XML data
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Learning regular expressions to template-based FAQ retrieval systems
Knowledge-Based Systems
Hi-index | 0.00 |
Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of deterministic regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the deterministic one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirically validated both on real world and synthetic data. Furthermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.