Inferring XML schema definitions from XML data
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Learning deterministic regular expressions for the inference of schemas from XML data
Proceedings of the 17th international conference on World Wide Web
Output schemas of XSLT stylesheets and their applications
Information Sciences: an International Journal
Inference of concise regular expressions and DTDs
ACM Transactions on Database Systems (TODS)
Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data
ACM Transactions on the Web (TWEB)
Recovering data semantics from XML documents into DTD graph with SAX
ACOS'06 Proceedings of the 5th WSEAS international conference on Applied computer science
Dealing with large schema sets in mobile SOS-based applications
Proceedings of the 2nd International Conference on Computing for Geospatial Research & Applications
Instance-based XML data binding for mobile devices
Proceedings of the Third International Workshop on Middleware for Pervasive Mobile and Embedded Computing
An unsupervised approach for acquiring ontologies and RDF data from online life science databases
ESWC'10 Proceedings of the 7th international conference on The Semantic Web: research and Applications - Volume Part II
Proceedings of the 21st international conference companion on World Wide Web
Hi-index | 0.00 |
XML is the de facto standard format for data exchange on the Web. While it is fairly simple to generate XML data, it is a complex task to design a schema and then guarantee that the generated data is valid according to that schema. As a consequence much XML data does not have a schema or is not accompanied by its schema. In order to gain the benefits of having a schema - efficient querying and storage of XML data, semantic verification, data integration, etc.- this schema must be extracted. In this paper we present an automatic technique, XStruct, for XML Schema extraction. Based on ideas of [5], XStruct extracts a schema for XML data by applying several heuristics to deduce regular expressions that are 1-unambiguous and describe each element's contents correctly but generalized to a reasonable degree. Our approach features several advantages over known techniques: XStruct scales to very large documents (beyond 1GB) both in time and memory consumption; it is able to extract a general, complete, correct, minimal, and understandable schema for multiple documents; it detects datatypes and attributes. Experiments confirm these features and properties.