IEEE Transactions on Pattern Analysis and Machine Intelligence
Efficient identification of regular expressions from representative examples
COLT '93 Proceedings of the sixth annual conference on Computational learning theory
Recent advances of grammatical inference
Theoretical Computer Science - Special issue on algorithmic learning theory
Extracting schema from semistructured data
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
LORE: a Lightweight Object REpository for semistructured data
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
One-unambiguous regular languages
Information and Computation
Storing semistructured data with STORED
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Inductive Inference: Theory and Methods
ACM Computing Surveys (CSUR)
Structural inference for semistructured data
Proceedings of the tenth international conference on Information and knowledge management
Introduction To Automata Theory, Languages, And Computation
Introduction To Automata Theory, Languages, And Computation
XTRACT: Learning Document Type Descriptors from XML Document Collections
Data Mining and Knowledge Discovery
Efficient extraction of schemas for XML documents
Information Processing Letters
Representative Objects: Concise Representations of Semistructured, Hierarchial Data
ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
Optimizing Regular Path Expressions Using Graph Schemas
ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Adding Structure to Unstructured Data
ICDT '97 Proceedings of the 6th International Conference on Database Theory
DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Answering XML Queries on Heterogeneous Data Sources
Proceedings of the 27th International Conference on Very Large Data Bases
Everything You Ever Wanted to Know About DTDs, But Were Afraid to Ask (Extended Abstract)
Selected papers from the Third International Workshop WebDB 2000 on The World Wide Web and Databases
Inductive Inference, DFAs, and Computational Complexity
AII '89 Proceedings of the International Workshop on Analogical and Inductive Inference
XPath Containment in the Presence of Disjunction, DTDs, and Variables
ICDT '03 Proceedings of the 9th International Conference on Database Theory
A survey of approaches to automatic schema matching
The VLDB Journal — The International Journal on Very Large Data Bases
WWW '03 Proceedings of the 12th international conference on World Wide Web
Generic Model Management: Concepts And Algorithms (Lecture Notes in Computer Science)
Generic Model Management: Concepts And Algorithms (Lecture Notes in Computer Science)
DTDs versus XML schema: a practical study
Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
XPath satisfiability in the presence of DTDs
Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
XML Schema
Queue - Semi-structured Data
Expressiveness and complexity of XML Schema
ACM Transactions on Database Systems (TODS)
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Algorithms for learning regular expressions
ALT'05 Proceedings of the 16th international conference on Algorithmic Learning Theory
Approximation to the smallest regular expression for a given regular language
CIAA'04 Proceedings of the 9th international conference on Implementation and Application of Automata
BPM'05 Proceedings of the Third international conference on Business Process Management
Shorter regular expressions from finite-state automata
CIAA'05 Proceedings of the 10th international conference on Implementation and Application of Automata
Self-correcting queries for xml
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Simple off the shelf abstractions for XML schema
ACM SIGMOD Record
Inferring XML schema definitions from XML data
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
From dirt to shovels: fully automatic tool generation from ad hoc data
Proceedings of the 35th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Learning deterministic regular expressions for the inference of schemas from XML data
Proceedings of the 17th international conference on World Wide Web
Minimization of tree pattern queries with constraints
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SchemaScope: a system for inferring and cleaning XML schemas
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Succinctness of Regular Expressions with Interleaving, Intersection and Counting
MFCS '08 Proceedings of the 33rd international symposium on Mathematical Foundations of Computer Science
Schema-Guided Induction of Monadic Queries
ICGI '08 Proceedings of the 9th international colloquium on Grammatical Inference: Algorithms and Applications
On Learning Regular Expressions and Patterns Via Membership and Correction Queries
ICGI '08 Proceedings of the 9th international colloquium on Grammatical Inference: Algorithms and Applications
Linear time membership in a class of regular expressions with interleaving and counting
Proceedings of the 17th ACM conference on Information and knowledge management
Ad Hoc Data and the Token Ambiguity Problem
PADL '09 Proceedings of the 11th International Symposium on Practical Aspects of Declarative Languages
Algorithms for learning regular expressions from positive data
Information and Computation
Efficient asymmetric inclusion between regular expression types
Proceedings of the 12th International Conference on Database Theory
Teaching XML data type definition: a visual method
Journal of Computing Sciences in Colleges
MCN: A New Semantics Towards Effective XML Keyword Search
DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
Simplifying XML schema: effortless handling of nondeterministic regular expressions
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient inclusion for a class of XML types with interleaving and counting
Information Systems
Information Systems
Information Systems
Regular expression learning for information extraction
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Linear inclusion for XML regular expression types
Proceedings of the 18th ACM conference on Information and knowledge management
Inference of concise regular expressions and DTDs
ACM Transactions on Database Systems (TODS)
Efficient inclusion for a class of XML types with interleaving and counting
DBPL'07 Proceedings of the 11th international conference on Database programming languages
Adaptive relaxation for querying heterogeneous XML data sources
Information Systems
Exploring XML web collections with DescribeX
ACM Transactions on the Web (TWEB)
Succinctness of regular expressions with interleaving, intersection and counting
Theoretical Computer Science
Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data
ACM Transactions on the Web (TWEB)
Ambiguous content and disambiguation of XML schemata
Proceedings of the Fourteenth International Database Engineering & Applications Symposium
XML schema and data summarization
ICAISC'10 Proceedings of the 10th international conference on Artifical intelligence and soft computing: Part II
Minimal tree language extensions: a keystone of XML type compatibility and evolution
ICTAC'10 Proceedings of the 7th International colloquium conference on Theoretical aspects of computing
Learning regular expressions from representative examples and membership queries
ICGI'10 Proceedings of the 10th international colloquium conference on Grammatical inference: theoretical results and applications
Complexity of Decision Problems for XML Schemas and Chain Regular Expressions
SIAM Journal on Computing
Proceedings of the 20th ACM international conference on Information and knowledge management
Optimizing schema languages for XML: numerical constraints and interleaving
ICDT'07 Proceedings of the 11th international conference on Database Theory
Finding optimal probabilistic generators for XML collections
Proceedings of the 15th International Conference on Database Theory
WebSelF: a web scraping framework
ICWE'12 Proceedings of the 12th international conference on Web Engineering
Improving recall of regular expressions for information extraction
WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Optimizing XML querying using type-based document projection
ACM Transactions on Database Systems (TODS)
Web Semantics: Science, Services and Agents on the World Wide Web
Query induction with schema-guided pruning strategies
The Journal of Machine Learning Research
Almost-linear inclusion for XML regular expression types
ACM Transactions on Database Systems (TODS)
On repairing structural problems in semi-structured data
Proceedings of the VLDB Endowment
Learning regular expressions to template-based FAQ retrieval systems
Knowledge-Based Systems
Conservative type extensions for XML data
Transactions on Large-Scale Data- and Knowledge-centered systems IX
Learning web-service task descriptions from traces
Web Intelligence and Agent Systems
Hi-index | 0.00 |
We consider the problem to infer a concise Document Type Definition (DTD) for a given set of XML-documents, a problem which basically reduces to learning of concise regular expressions from positive example strings. We identify two such classes: single occurrence regular expressions (SOREs) and chain regular expressions (CHAREs). Both classes capture the far majority of the regular expressions occurring in practical DTDs and are succinct by definition. We present the algorithm iDTD (infer DTD) that learns SOREs from strings by first inferring an automaton by known techniques and then translating that automaton to a corresponding SORE, possibly by repairing the automaton when no equivalent SORE can be found. In the process, we introduce a novel automaton to regular expression rewrite technique which is of independent interest. We show that iDTD outperforms existing systems in accuracy, conciseness and speed. In a scenario where only a very small amount of XML data is available, for instance when generated by Web service requests or by answers to queries, iDTD produces regular expressions which are too specific. Therefore, we introduce a novel learning algorithm CRX that directly infers CHAREs (which form a subclass of SOREs) without going through an automaton representation. We show that CRX performs very well within its target class on very small data sets. Finally, we discuss incremental computation, noise, numerical predicates, and the generation of XML Schemas.